-
Notifications
You must be signed in to change notification settings - Fork 892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add logs for ensemble select failed #3779
Add logs for ensemble select failed #3779
Conversation
@horizonzy Please help take a look at this Pr, thanks. |
Codecov Report
@@ Coverage Diff @@
## master #3779 +/- ##
=============================================
+ Coverage 58.20% 68.23% +10.03%
- Complexity 5605 6714 +1109
=============================================
Files 467 473 +6
Lines 40828 40955 +127
Branches 5234 5236 +2
=============================================
+ Hits 23762 27944 +4182
+ Misses 14961 10756 -4205
- Partials 2105 2255 +150
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After #3721, It may invoke RackawareEnsemblePlacementPolicyImpl#selectRandomFromRack many times.
It may a little noise.
@@ -675,6 +675,8 @@ protected BookieNode selectRandomFromRack(String netPath, Set<Node> excludeBooki | |||
} | |||
return bn; | |||
} | |||
LOG.warn("Failed to select bookie node from path: {}, leaves: {}, exclude Bookies: {}, ensemble: {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't log that deep.
1st caller to this method is:
// first attempt to select one from local rack
try {
return selectRandomFromRack(networkLoc, excludeBookies, predicate, ensemble);
} catch (BKNotEnoughBookiesException e) {
/*
* there is no enough bookie from local rack, select bookies from
* the whole cluster and exclude the racks specified at
* <tt>excludeRacks</tt>.
*/
return selectFromNetworkLocation(excludeRacks, excludeBookies, predicate, ensemble, fallbackToRandom);
}
For this caller, it's not a WARN yet, since we may still choose a bookie another rack which satisfies rack awareness.
We can print INFO in the catch here.
2nd caller:
// select one from local rack
try {
return selectRandomFromRack(networkLoc, excludeBookies, predicate, ensemble);
} catch (BKNotEnoughBookiesException e) {
if (!fallbackToRandom) {
LOG.error(
"Failed to choose a bookie from {} : "
+ "excluded {}, enforceMinNumRacksPerWriteQuorum is enabled so giving up.",
networkLoc, excludeBookies);
throw e;
}
LOG.warn("Failed to choose a bookie from {} : "
+ "excluded {}, fallback to choose bookie randomly from the cluster.",
networkLoc, excludeBookies);
// randomly choose one from whole cluster, ignore the provided predicate.
return selectRandom(1, excludeBookies, predicate, ensemble).get(0);
Here we have the error or WARN anyway, so no need to add any log here.
Since there are so many ways to this method to throw an exception, I would add a reason message at each throw, and include it in the logs printed which receive it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asafm Good point, thanks for your suggestion. I moved the log out of the method and put it in the catch block. Please help take a look again, thanks a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you say about :
Since there are so many ways to this method to throw an exception, I would add a reason message at each throw, and include it in the logs printed which receive it.
Since selectRandomFromRack
has several ways to fail and throw an exception, how about we include an error message in the exception and log it as well?
@horizonzy Thanks for your review, I moved the log out of the method and put it in the catch block |
LOG.warn("Failed to choose a bookie from {} : " | ||
+ "excluded {}, fallback to choose bookie randomly from the cluster.", | ||
networkLoc, excludeBookies); | ||
LOG.warn("Failed to choose a bookie from {} : leaves {}, excluded bookies {}, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The leaves
may confuse the user.
Failed to choose a bookie from {} : leaves -> Failed to choose a bookie from network location {}, the network location corresponding bookies {}
@@ -544,6 +544,9 @@ public BookieNode selectFromNetworkLocation(String networkLoc, | |||
* the whole cluster and exclude the racks specified at | |||
* <tt>excludeRacks</tt>. | |||
*/ | |||
LOG.warn("Failed to choose a bookie node from {} : leaves {}, exclude Bookies {}, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The leaves
may confuse the user.
Failed to choose a bookie from {} : leaves -> Failed to choose a bookie from network location {}, the network location corresponding bookies {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@@ -544,6 +545,10 @@ public BookieNode selectFromNetworkLocation(String networkLoc, | |||
* the whole cluster and exclude the racks specified at | |||
* <tt>excludeRacks</tt>. | |||
*/ | |||
LOG.warn("Failed to choose a bookie node from network location {}, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you think here it is actually an INFO?
What if we have 5 racks. If we failed choosing from same rack (say rack 1) and we can't choose from rack 2 and 3 since they already have other copies, we can still choose from rack 4, and it's ok, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it still can choose from rack 4. But if we only have 3 racks (rack1, rack2, and rack3), and we failed to choose any bookie nodes from rack1, it will randomly choose one from rack2 or rack3, which will lead to 3 replicas located on 2 racks.
IMO, we'd better use the WARN
level, because it will choose one bookie node from the whole bookie cluster randomly and it has the risk of the chosen bookie being located on the same rack with existing replicas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
### Motivation We have 3 bookies in the same rack, and configured `E = 2, W = 2, A = 2`. When one bookie restarted, we found the ledger select bookie from the same rack failed in the ensemble change replacing the failed bookie step. Due to there being no log information in `selectRandomFromRack`, it's hard to debug the root cause. https://github.com/apache/bookkeeper/blob/02e64a4b97e03afc9993ab227f82a5956965c03f/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L630-L679 ### Modification Add one warn log in `selectRandomFromRack` when selecting a new bookie node failed. (cherry picked from commit 9ff2954)
### Motivation We have 3 bookies in the same rack, and configured `E = 2, W = 2, A = 2`. When one bookie restarted, we found the ledger select bookie from the same rack failed in the ensemble change replacing the failed bookie step. Due to there being no log information in `selectRandomFromRack`, it's hard to debug the root cause. https://github.com/apache/bookkeeper/blob/02e64a4b97e03afc9993ab227f82a5956965c03f/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L630-L679 ### Modification Add one warn log in `selectRandomFromRack` when selecting a new bookie node failed. (cherry picked from commit 9ff2954)
Motivation
We have 3 bookies in the same rack, and configured
E = 2, W = 2, A = 2
. When one bookie restarted, we found the ledger select bookie from the same rack failed in the ensemble change replacing the failed bookie step.Due to there being no log information in
selectRandomFromRack
, it's hard to debug the root cause.bookkeeper/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java
Lines 630 to 679 in 02e64a4
Modification
Add one warn log in
selectRandomFromRack
when selecting a new bookie node failed.