Skip to content

Issue#1886 Handle double bookie failures#1887

Merged
eolivelli merged 5 commits intoapache:branch-4.8from
jvrao:metadata-final
Dec 20, 2018
Merged

Issue#1886 Handle double bookie failures#1887
eolivelli merged 5 commits intoapache:branch-4.8from
jvrao:metadata-final

Conversation

@jvrao
Copy link
Copy Markdown
Contributor

@jvrao jvrao commented Dec 14, 2018

For this race condition to happen:

  1. ZK metadata version is different from Client write ledger
    handle - Replication worker
  2. Client made an ensemble change and replaced a bookie,
    sent change proposal to zk
  3. While this is pending, Client made another ensemble change
    replaced the same index with the bookie that was prior to step#2
  4. Ensemble change made in step#2 came back to client with
    version conflict error.
  5. Client need to resolve this conflict

The fix is to reconsile the local state, zk state, and do bookie
replaceemnt, send the updated metadata to zk.

Signed-off-by: Venkateswararao Jujjuri (JV) vjujjuri@salesforce.com
(@Rev Sam Just@)

Descriptions of the changes in this PR:

Motivation

(Explain: why you're making that change, what is the problem you're trying to solve)

Changes

(Describe: what changes you have made)

Master Issue: #


In order to uphold a high standard for quality for code contributions, Apache BookKeeper runs various precommit
checks for pull requests. A pull request can only be merged when it passes precommit checks. However running all
the precommit checks can take a long time, some trivial changes don't need to run all the precommit checks. You
can check following list to skip the tests that don't need to run for your pull request. Leave them unchecked if
you are not sure, committers will help you:

  • [skip bookkeeper-server bookie tests]: skip testing org.apache.bookkeeper.bookie in bookkeeper-server module.
  • [skip bookkeeper-server client tests]: skip testing org.apache.bookkeeper.client in bookkeeper-server module.
  • [skip bookkeeper-server replication tests]: skip testing org.apache.bookkeeper.replication in bookkeeper-server module.
  • [skip bookkeeper-server tls tests]: skip testing org.apache.bookkeeper.tls in bookkeeper-server module.
  • [skip bookkeeper-server remaining tests]: skip testing all other tests in bookkeeper-server module.
  • [skip integration tests]: skip docker based integration tests. if you make java code changes, you shouldn't skip integration tests.
  • [skip build java8]: skip build on java8. ONLY skip this when ONLY changing files under documentation under site.
  • [skip build java9]: skip build on java9. ONLY skip this when ONLY changing files under documentation under site.


Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

If this PR is a BookKeeper Proposal (BP):

  • Make sure the PR title is formatted like:
    <BP-#>: Description of bookkeeper proposal
    e.g. BP-1: 64 bits ledger is support
  • Attach the master issue link in the description of this PR.
  • Attach the google doc link if the BP is written in Google Doc.

Otherwise:

  • Make sure the PR title is formatted like:
    <Issue #>: Description of pull request
    e.g. Issue 123: Description ...
  • Make sure tests pass via mvn clean apache-rat:check install spotbugs:check.
  • Replace <Issue #> in the title with the actual Issue number.

For this race condition to happen:
1. ZK metadata version is different from Client write ledger
   handle - Replication worker
2. Client made an ensemble change and replaced a bookie,
   sent change proposal to zk
3. While this is pending, Client made another ensemble change
   replaced the same index with the bookie that was prior to step#2
4. Ensemble change made in step#2 came back to client with
   version conflict error.
5. Client need to resolve this conflict

The fix is to reconsile the local state, zk state, and do bookie
replaceemnt, send the updated metadata to zk.

Signed-off-by: Venkateswararao Jujjuri (JV) <vjujjuri@salesforce.com>
(@Rev Sam Just@)
// No test inserts by default
return false;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not like the stub function to inject mock object, but I ran into various complications with mocking inner class methods which are generated on demand. If there is any better way I am more than happy to adopt it. Given that this is not applicable for master I gave pull request to 4.8 only. Not sure if we have a test case in the master to cover this case. @ivankelly ?

ensembleInfo = replaceBookieInMetadata(ensembleInfo.failedBookies, numEnsembleChanges.get());
} catch (BKException.BKNotEnoughBookiesException e) {
LOG.error("Could not get additional bookie to remake ensemble, closing ledger: {}", ledgerId);
handleUnrecoverableErrorDuringAdd(e.getCode());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we return false here ReReadLedgerMetadataCb will already handle calling handleUnrecoverableErrorDuringAdd, so doing it here results in it being called twice.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true; I will remove the handleUnrecoverableErrorDuringAdd() from here. There are other places in this method returns false but won't call handleUnrecoverableErrorDuringAdd.

@athanatos
Copy link
Copy Markdown

I don't really have a problem with the test stub, tbh.

injectFailedBookies.put(0, newBkAddr);
when(mlh.testStubResolveConflict()).thenAnswer(
invoke -> {
LOG.info("JV inside testInsertEnsembleChange");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"JV inside"!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL; you don't like it ?

*/
private boolean resolveConflict(LedgerMetadata newMeta) {
LedgerMetadata metadata = getLedgerMetadata();
testStubResolveConflict();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. i dont get how is this method impacting anyway since you are not using this method return value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for the test stub to mock and interject errors at this point. This is LedgerHandle level visibility, hence easy to mock.

@jvrao
Copy link
Copy Markdown
Contributor Author

jvrao commented Dec 17, 2018

rebuild java8

@jvrao
Copy link
Copy Markdown
Contributor Author

jvrao commented Dec 18, 2018

@ivankelly @sijie @eolivelli can one of you review this please?

@sijie
Copy link
Copy Markdown
Member

sijie commented Dec 19, 2018

run integration tests

@jvrao
Copy link
Copy Markdown
Contributor Author

jvrao commented Dec 19, 2018

+ .test-infra/scripts/pre-docker-tests.sh
/tmp/jenkins2319925842930863686.sh: line 2: .test-infra/scripts/pre-docker-tests.sh: No such file or directory
Build step 'Execute shell' marked build as failure```
@sijie 

@sijie
Copy link
Copy Markdown
Member

sijie commented Dec 19, 2018

@jvrao oh I see. it seems the integration tests job was changed in master. I think we can ignore the integration job for merging this.

@sijie
Copy link
Copy Markdown
Member

sijie commented Dec 19, 2018

IGNORE IT CI

@jvrao
Copy link
Copy Markdown
Contributor Author

jvrao commented Dec 20, 2018

Cool; shall we merge this then? @sijie

Copy link
Copy Markdown
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Merging it now

@eolivelli
Copy link
Copy Markdown
Contributor

IGNORE CI

@eolivelli
Copy link
Copy Markdown
Contributor

@jvrao I have merged this patch. Thanks.

If you want you can self-merge if the patch has been approved by any other committer, just by using dev/bk-merge-pr.py from the command line

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants