Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logs for ensemble select failed #3779

Merged

Conversation

hangc0276
Copy link
Contributor

Motivation

We have 3 bookies in the same rack, and configured E = 2, W = 2, A = 2. When one bookie restarted, we found the ledger select bookie from the same rack failed in the ensemble change replacing the failed bookie step.

Due to there being no log information in selectRandomFromRack, it's hard to debug the root cause.

protected BookieNode selectRandomFromRack(String netPath, Set<Node> excludeBookies, Predicate<BookieNode> predicate,
Ensemble<BookieNode> ensemble) throws BKNotEnoughBookiesException {
WeightedRandomSelection<BookieNode> wRSelection = null;
List<Node> leaves = new ArrayList<Node>(topology.getLeaves(netPath));
if (!this.isWeighted) {
Collections.shuffle(leaves);
} else {
if (CollectionUtils.subtract(leaves, excludeBookies).size() < 1) {
throw new BKNotEnoughBookiesException();
}
wRSelection = prepareForWeightedSelection(leaves);
if (wRSelection == null) {
throw new BKNotEnoughBookiesException();
}
}
Iterator<Node> it = leaves.iterator();
Set<Node> bookiesSeenSoFar = new HashSet<Node>();
while (true) {
Node n;
if (isWeighted) {
if (bookiesSeenSoFar.size() == leaves.size()) {
// Don't loop infinitely.
break;
}
n = wRSelection.getNextRandom();
bookiesSeenSoFar.add(n);
} else {
if (it.hasNext()) {
n = it.next();
} else {
break;
}
}
if (excludeBookies.contains(n)) {
continue;
}
if (!(n instanceof BookieNode) || !predicate.apply((BookieNode) n, ensemble)) {
continue;
}
BookieNode bn = (BookieNode) n;
// got a good candidate
if (ensemble.addNode(bn)) {
// add the candidate to exclude set
excludeBookies.add(bn);
}
return bn;
}
throw new BKNotEnoughBookiesException();
}

Modification

Add one warn log in selectRandomFromRack when selecting a new bookie node failed.

@hangc0276
Copy link
Contributor Author

@horizonzy Please help take a look at this Pr, thanks.

@codecov-commenter
Copy link

codecov-commenter commented Feb 14, 2023

Codecov Report

Merging #3779 (2f22475) into master (b9a82c5) will increase coverage by 10.03%.
The diff coverage is n/a.

@@              Coverage Diff              @@
##             master    #3779       +/-   ##
=============================================
+ Coverage     58.20%   68.23%   +10.03%     
- Complexity     5605     6714     +1109     
=============================================
  Files           467      473        +6     
  Lines         40828    40955      +127     
  Branches       5234     5236        +2     
=============================================
+ Hits          23762    27944     +4182     
+ Misses        14961    10756     -4205     
- Partials       2105     2255      +150     
Flag Coverage Δ
bookie 39.66% <ø> (?)
client 44.30% <ø> (+0.07%) ⬆️
remaining 29.66% <ø> (+0.15%) ⬆️
replication 41.36% <ø> (+0.07%) ⬆️
tls 20.99% <ø> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...r/client/RackawareEnsemblePlacementPolicyImpl.java 82.29% <ø> (+0.39%) ⬆️
...che/bookkeeper/meta/zk/ZKMetadataClientDriver.java 92.59% <0.00%> (-3.71%) ⬇️
.../apache/bookkeeper/proto/ReadEntryProcessorV3.java 62.50% <0.00%> (-2.09%) ⬇️
...pache/bookkeeper/replication/AutoRecoveryMain.java 39.74% <0.00%> (-1.93%) ⬇️
.../org/apache/bookkeeper/proto/BookieClientImpl.java 64.05% <0.00%> (-1.78%) ⬇️
...kkeeper/client/DefaultEnsemblePlacementPolicy.java 87.80% <0.00%> (-1.22%) ⬇️
...apache/bookkeeper/meta/ZkLedgerAuditorManager.java 71.69% <0.00%> (-0.95%) ⬇️
...pache/bookkeeper/meta/zk/ZKMetadataDriverBase.java 60.71% <0.00%> (-0.72%) ⬇️
...keeper/util/collections/ConcurrentOpenHashMap.java 90.41% <0.00%> (-0.69%) ⬇️
...pache/bookkeeper/meta/AbstractZkLedgerManager.java 55.87% <0.00%> (-0.64%) ⬇️
... and 128 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@horizonzy horizonzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #3721, It may invoke RackawareEnsemblePlacementPolicyImpl#selectRandomFromRack many times.
It may a little noise.

@@ -675,6 +675,8 @@ protected BookieNode selectRandomFromRack(String netPath, Set<Node> excludeBooki
}
return bn;
}
LOG.warn("Failed to select bookie node from path: {}, leaves: {}, exclude Bookies: {}, ensemble: {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't log that deep.

1st caller to this method is:

// first attempt to select one from local rack
        try {
            return selectRandomFromRack(networkLoc, excludeBookies, predicate, ensemble);
        } catch (BKNotEnoughBookiesException e) {
            /*
             * there is no enough bookie from local rack, select bookies from
             * the whole cluster and exclude the racks specified at
             * <tt>excludeRacks</tt>.
             */
            return selectFromNetworkLocation(excludeRacks, excludeBookies, predicate, ensemble, fallbackToRandom);
        }

For this caller, it's not a WARN yet, since we may still choose a bookie another rack which satisfies rack awareness.
We can print INFO in the catch here.

2nd caller:

        // select one from local rack
        try {
            return selectRandomFromRack(networkLoc, excludeBookies, predicate, ensemble);
        } catch (BKNotEnoughBookiesException e) {
            if (!fallbackToRandom) {
                LOG.error(
                        "Failed to choose a bookie from {} : "
                                + "excluded {}, enforceMinNumRacksPerWriteQuorum is enabled so giving up.",
                        networkLoc, excludeBookies);
                throw e;
            }
            LOG.warn("Failed to choose a bookie from {} : "
                     + "excluded {}, fallback to choose bookie randomly from the cluster.",
                     networkLoc, excludeBookies);
            // randomly choose one from whole cluster, ignore the provided predicate.
            return selectRandom(1, excludeBookies, predicate, ensemble).get(0);

Here we have the error or WARN anyway, so no need to add any log here.

Since there are so many ways to this method to throw an exception, I would add a reason message at each throw, and include it in the logs printed which receive it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asafm Good point, thanks for your suggestion. I moved the log out of the method and put it in the catch block. Please help take a look again, thanks a lot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you say about :

Since there are so many ways to this method to throw an exception, I would add a reason message at each throw, and include it in the logs printed which receive it.

Since selectRandomFromRack has several ways to fail and throw an exception, how about we include an error message in the exception and log it as well?

@hangc0276
Copy link
Contributor Author

After #3721, It may invoke RackawareEnsemblePlacementPolicyImpl#selectRandomFromRack many times. It may a little noise.

@horizonzy Thanks for your review, I moved the log out of the method and put it in the catch block

@hangc0276 hangc0276 requested review from asafm and horizonzy and removed request for asafm and horizonzy February 18, 2023 01:48
LOG.warn("Failed to choose a bookie from {} : "
+ "excluded {}, fallback to choose bookie randomly from the cluster.",
networkLoc, excludeBookies);
LOG.warn("Failed to choose a bookie from {} : leaves {}, excluded bookies {}, "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The leaves may confuse the user.

Failed to choose a bookie from {} : leaves -> Failed to choose a bookie from network location {}, the network location corresponding bookies {}

@@ -544,6 +544,9 @@ public BookieNode selectFromNetworkLocation(String networkLoc,
* the whole cluster and exclude the racks specified at
* <tt>excludeRacks</tt>.
*/
LOG.warn("Failed to choose a bookie node from {} : leaves {}, exclude Bookies {}, "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The leaves may confuse the user.

Failed to choose a bookie from {} : leaves -> Failed to choose a bookie from network location {}, the network location corresponding bookies {}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense

Copy link
Member

@horizonzy horizonzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@@ -544,6 +545,10 @@ public BookieNode selectFromNetworkLocation(String networkLoc,
* the whole cluster and exclude the racks specified at
* <tt>excludeRacks</tt>.
*/
LOG.warn("Failed to choose a bookie node from network location {}, "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you think here it is actually an INFO?
What if we have 5 racks. If we failed choosing from same rack (say rack 1) and we can't choose from rack 2 and 3 since they already have other copies, we can still choose from rack 4, and it's ok, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it still can choose from rack 4. But if we only have 3 racks (rack1, rack2, and rack3), and we failed to choose any bookie nodes from rack1, it will randomly choose one from rack2 or rack3, which will lead to 3 replicas located on 2 racks.

IMO, we'd better use the WARN level, because it will choose one bookie node from the whole bookie cluster randomly and it has the risk of the chosen bookie being located on the same rack with existing replicas.

Copy link
Contributor

@StevenLuMT StevenLuMT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zymap zymap merged commit 9ff2954 into apache:master Feb 22, 2023
zymap pushed a commit that referenced this pull request Feb 22, 2023
### Motivation
We have 3 bookies in the same rack, and configured `E = 2, W = 2, A = 2`. When one bookie restarted, we found the ledger select bookie from the same rack failed in the ensemble change replacing the failed bookie step.

Due to there being no log information in `selectRandomFromRack`, it's hard to debug the root cause.
https://github.com/apache/bookkeeper/blob/02e64a4b97e03afc9993ab227f82a5956965c03f/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L630-L679

### Modification
Add one warn log in `selectRandomFromRack` when selecting a new bookie node failed.

(cherry picked from commit 9ff2954)
hangc0276 added a commit to hangc0276/bookkeeper that referenced this pull request Jun 26, 2023
### Motivation
We have 3 bookies in the same rack, and configured `E = 2, W = 2, A = 2`. When one bookie restarted, we found the ledger select bookie from the same rack failed in the ensemble change replacing the failed bookie step.

Due to there being no log information in `selectRandomFromRack`, it's hard to debug the root cause.
https://github.com/apache/bookkeeper/blob/02e64a4b97e03afc9993ab227f82a5956965c03f/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L630-L679

### Modification
Add one warn log in `selectRandomFromRack` when selecting a new bookie node failed.

(cherry picked from commit 9ff2954)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants