Issue#1886 Handle double bookie failures#1887
Conversation
For this race condition to happen: 1. ZK metadata version is different from Client write ledger handle - Replication worker 2. Client made an ensemble change and replaced a bookie, sent change proposal to zk 3. While this is pending, Client made another ensemble change replaced the same index with the bookie that was prior to step#2 4. Ensemble change made in step#2 came back to client with version conflict error. 5. Client need to resolve this conflict The fix is to reconsile the local state, zk state, and do bookie replaceemnt, send the updated metadata to zk. Signed-off-by: Venkateswararao Jujjuri (JV) <vjujjuri@salesforce.com> (@Rev Sam Just@)
| // No test inserts by default | ||
| return false; | ||
| } | ||
|
|
There was a problem hiding this comment.
I did not like the stub function to inject mock object, but I ran into various complications with mocking inner class methods which are generated on demand. If there is any better way I am more than happy to adopt it. Given that this is not applicable for master I gave pull request to 4.8 only. Not sure if we have a test case in the master to cover this case. @ivankelly ?
| ensembleInfo = replaceBookieInMetadata(ensembleInfo.failedBookies, numEnsembleChanges.get()); | ||
| } catch (BKException.BKNotEnoughBookiesException e) { | ||
| LOG.error("Could not get additional bookie to remake ensemble, closing ledger: {}", ledgerId); | ||
| handleUnrecoverableErrorDuringAdd(e.getCode()); |
There was a problem hiding this comment.
I think if we return false here ReReadLedgerMetadataCb will already handle calling handleUnrecoverableErrorDuringAdd, so doing it here results in it being called twice.
There was a problem hiding this comment.
true; I will remove the handleUnrecoverableErrorDuringAdd() from here. There are other places in this method returns false but won't call handleUnrecoverableErrorDuringAdd.
|
I don't really have a problem with the test stub, tbh. |
| injectFailedBookies.put(0, newBkAddr); | ||
| when(mlh.testStubResolveConflict()).thenAnswer( | ||
| invoke -> { | ||
| LOG.info("JV inside testInsertEnsembleChange"); |
There was a problem hiding this comment.
LOL; you don't like it ?
| */ | ||
| private boolean resolveConflict(LedgerMetadata newMeta) { | ||
| LedgerMetadata metadata = getLedgerMetadata(); | ||
| testStubResolveConflict(); |
There was a problem hiding this comment.
hmm.. i dont get how is this method impacting anyway since you are not using this method return value.
There was a problem hiding this comment.
this is for the test stub to mock and interject errors at this point. This is LedgerHandle level visibility, hence easy to mock.
|
rebuild java8 |
|
@ivankelly @sijie @eolivelli can one of you review this please? |
|
run integration tests |
|
|
@jvrao oh I see. it seems the integration tests job was changed in master. I think we can ignore the integration job for merging this. |
|
IGNORE IT CI |
|
Cool; shall we merge this then? @sijie |
|
IGNORE CI |
|
@jvrao I have merged this patch. Thanks. If you want you can self-merge if the patch has been approved by any other committer, just by using dev/bk-merge-pr.py from the command line |
For this race condition to happen:
handle - Replication worker
sent change proposal to zk
replaced the same index with the bookie that was prior to step#2
version conflict error.
The fix is to reconsile the local state, zk state, and do bookie
replaceemnt, send the updated metadata to zk.
Signed-off-by: Venkateswararao Jujjuri (JV) vjujjuri@salesforce.com
(@Rev Sam Just@)
Descriptions of the changes in this PR:
Motivation
(Explain: why you're making that change, what is the problem you're trying to solve)
Changes
(Describe: what changes you have made)
Master Issue: #