New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: auto recover support repaired not adhering placement ledger #3359
Feature: auto recover support repaired not adhering placement ledger #3359
Conversation
64bc687
to
12582a7
Compare
For this case, we already support detect these ledger which ensemble is not adhering placement policy at now. bookkeeper/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java Line 1378 in 677ccec
But it only record it to stat, not recover data to make ensemble to adhere placement policy. So we can add a config In Auditor In ReplicationWorker Attention |
How to use it? If we want to repaired the ledger which ensemble is not adhering placement policy, we should config two param.
In Auditor In ReplicationWorker Attention
|
ping @merlimat @eolivelli @dlg99 @zymap @reddycharan Please help take a look at this PR, thanks. |
@@ -402,6 +402,7 @@ public void registerLedgerMetadataListener(long ledgerId, LedgerMetadataListener | |||
} | |||
} | |||
synchronized (listenerSet) { | |||
listenerSet = listeners.computeIfAbsent(ledgerId, k -> new HashSet<>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you bing the previous change to this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I will remove it in this pr.
bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/ReplicationWorker.java
Show resolved
Hide resolved
Map<String, List<BookieNode>> toPlaceGroup = new HashMap<>(); | ||
for (BookieId bookieId : ensemble) { | ||
//If the bookieId shutdown, put it to inactive. | ||
BookieNode bookieNode = clone.get(bookieId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the bookie shutdown, it will be removed from knownBookies immediately. It belongs to DATA_LOSS
type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, If the the fragment is DATA_LOSS, it won't invoke this method. It will repair data loss firstly.
But the bookie maybe shutdown after ledger check, so here replace the shutdown bookie firstlt
...-server/src/main/java/org/apache/bookkeeper/client/TopologyAwareEnsemblePlacementPolicy.java
Outdated
Show resolved
Hide resolved
return bn; | ||
} | ||
} | ||
throw new BKNotEnoughBookiesException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we log something if we reach to this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It already log, in doReplaceToAdherePlacementPolicy
, it catch BKNotEnoughBookiesException
and log it.
} | ||
} | ||
|
||
private int differBetweenBookies(List<BookieId> bookiesA, List<BookieId> bookiesB) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: static?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
int ackQuorumSize, | ||
Set<BookieId> excludeBookies, | ||
List<BookieId> currentEnsemble) { | ||
throw new UnsupportedOperationException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you don't override this method?
Arw we handling this exception in the code that calls this method?
My understanding is that we cannot provide a good default implementation.
In the called code we could catch this exception, log something and abort gracefully the operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's fine.
@eolivelli ping, address the comment, could you review it again |
bookkeeper-server/src/main/java/org/apache/bookkeeper/client/EnsemblePlacementPolicy.java
Show resolved
Hide resolved
bookkeeper-server/src/main/java/org/apache/bookkeeper/client/EnsemblePlacementPolicy.java
Show resolved
Hide resolved
...-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java
Outdated
Show resolved
Hide resolved
...-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java
Show resolved
Hide resolved
bookkeeper-server/src/main/java/org/apache/bookkeeper/net/NetworkTopology.java
Outdated
Show resolved
Hide resolved
And what about the If it is being addressed, please let me know where it is. |
I means if we didn't handle the shutdown bookies, it will be handle as default-bookie, the default-rack is different with other bookie's rack, so it won't be replaced. Now we add the shutdown bookies to excludes nodes to replace it. |
Just confirm, are you mentioned here? |
yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your explanation.
LGTM.
fix old workflow,please see #3455 for detail |
rerun failure checks |
1 similar comment
rerun failure checks |
This PR is an enhancement for auto recovery, and the new interface has a default implementation, which is compatible with the old version. I suggest cherry-picking it to branch-4.14 and branch-4.15. Do you have any suggestions? @merlimat @eolivelli @dlg99 @rdhabalia @zymap |
@hangc0276 please ask on dev@ |
…pache#3359) (cherry picked from commit fc981ba)
yes, we should cherry-pick it. |
…pache#3359) (cherry picked from commit fc981ba)
Descriptions of the changes in this PR:
There is a user case.
We should support a feature to cover this case.