Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deferred failure handling can cause data loss #1591

Closed
ivankelly opened this issue Aug 9, 2018 · 15 comments
Closed

Deferred failure handling can cause data loss #1591

ivankelly opened this issue Aug 9, 2018 · 15 comments

Comments

@ivankelly
Copy link
Contributor

The bookkeeper client has a feature where if you have a ledger with a write quorum(Qw) larger than an ack quorum(Qa), such as 3:3:2, if (Qw-Qa) bookies return an error, after the entry add has completed, the erroring bookie will be replaced in the ensemble in the background.

This can cause data loss.

Consider a 3:3:2 ledger. Assume zookeeper is not accepting writes.

Start ensemble is b1,b2,b3

  • e0 is written to b1,b2,b3
  • b2 & b3 acknowledge, e0 request completes
  • b1 responds with and error and is added to the delayed ensemble change list
  • e1 is written, delayed error handling kicks off, b1 is replaced, so e1 is written to b4,b2,b3 while the client tries to update the metadata in zookeeper. zookeeper is stalled so fails to respond.
  • repeat this sequence for each of b2 and b3.

As each bookie fails, it will be replaced and writes will be acknowledged. Eventually the ensembles will look something like

0: b1,b2,b3
1: b4,b2,b3
3: b4,b5,b3
4: b4,b5,b6

How ever, this is only local, zookeeper still only has the initial ensemble. So, even though all entries from 4 onwards are acknowledged successfully to the client, if another client comes to read the ledger, they will not see them. If the other client recovers the ledger, the data is lost. TOAB violation.

Here's a test case which triggers the issue:
https://github.com/ivankelly/bookkeeper/blob/15bc251d46d5cd5fcceef130c0046eeacbe446cc/bookkeeper-server/src/test/java/org/apache/bookkeeper/client/TestDeferredFailure.java

@sijie
Copy link
Member

sijie commented Aug 9, 2018

@jvrao can you take a look at this issue? this is related to #1395 .

@ivankelly
Copy link
Contributor Author

@sijie @jvrao the core issue as I see it, is that we are allowing adds to be completed while the change in the progress. While this is nice from a latency point of view, it breaks our guarantees. A quick fix would be to block completions the same way we do for normal error recovery.

@ivankelly
Copy link
Contributor Author

Also, we should only allow a single ensemble change at a time per ledger. The case where you need more than one is rare enough that people can live with the latency hit. It would solve the scenario in the test case posted, though it wouldn't solve the general case (2:2:1 will still have this type of failure).

@jvrao
Copy link
Contributor

jvrao commented Aug 9, 2018 via email

@sijie
Copy link
Member

sijie commented Aug 9, 2018

A quick fix would be to block completions the same way we do for normal error recovery.

I think a quick fix is handleDelayedWriteBookieFailure should increment blockAddCompletions to block completions.

@jvrao
Copy link
Contributor

jvrao commented Aug 9, 2018

I will send a patch asap.

@ivankelly
Copy link
Contributor Author

@sijie exactly, the blockAddCompletions stuff should fix this.
If handleDelayedWriteBookieFailure does the blocking, then there's very little to differentiate it from handleBookieFailures, so the logic can be pretty much the same.

@ivankelly
Copy link
Contributor Author

I think we should restrict one outstanding ensemble change in all cases.

@jvrao just to reemphasize, this restriction, while simplifying things won't solve this problem if adds can complete while the change is happening.

The 2:2:1 case I was talking about is.

  • Two bookies, b1, b2
  • Write to b1 & b2
  • b1 acks successfully, b2 hangs, as ack quorum is 1 write is acknowledged
  • ensemble changes b2 for b3, but stalls writing to zk
  • b1 hangs
  • a bunch of entries are written, as b3 is in the local ensemble and ack quorum is 1, these entries are all written
  • Client crashes before ever being able to update the zk metadata.

So, we should absolutely make it so a single ensemble change occurs at a time (it should be easier now with CompletableFuture and stuff. This wasn't available last time this had a major revisit).

But the fix should be blocking add completions while ensemble change is in progress. If, by the time I start tomorrow you haven't a started on a patch, i'll submit something.

@jvrao
Copy link
Contributor

jvrao commented Aug 9, 2018

For now the fix is to call handleBookieFailure()

    void handleDelayedWriteBookieFailure() {
        handleBookieFailure(delayedWriteFailedBookies);
    }

Until @ivankelly 's immutable local metadata comes in. The original intent of this change
is to do a best-effort ensemble change. Since we are mutating local metadata we can't ignore
failures. So I will make this change and checkin. But I strongly second the proposal of only one outstanding metadata change at any time for any of our changes, hold the writer until we hear back. @ivankelly you can submit that change and revert this back.

@ivankelly
Copy link
Contributor Author

For completeness, the same scenario with immutable metadata would be.

Two bookie ensemble, b1,b2, ensemble is 2:2:1

  • Write e0 to b1 (success), b2 (fail after write complete)
  • b2 replaced in with ensemble 1: b1, b3, write sent to zookeeper
  • Write e1 to b1 (fail for some reason), b2 (success for some reason)
  • Write to zookeeper completes

So e1 is only on b2, which if the ensemble is 1: b1, b3, can never be read back by a client. Again the source of the issue is allowing add acknowledgement while changing ensemble.

@ivankelly
Copy link
Contributor Author

@jvrao

The original intent of this change is to do a best-effort ensemble change

With immutable this should work quite well. There's 2 cases, a) there's more bookies available, b) there's no more bookies available.

With a) we block completions while replacing.
With b) we don't modify anything locally because there's nothing to modify with. completion can continue on the old quorum without issue (though this needs to be taken into account when we make the change, we shouldn't block completions until we know there are bookies available for the replacement).

@ivankelly
Copy link
Contributor Author

@jvrao fix looks good btw 👍

@jvrao
Copy link
Contributor

jvrao commented Aug 9, 2018

a) we block completions while replacing

How does that protect the scenario you mentioned with "Write e1 to b1 (fail for some reason), b2 (success for some reason)" Unless you block writes while the proposed change to the metadata is outstanding? In the current code, local metadata is already changed hence the e1 would have gone on b1,b3 so blocking on the completion is good enough. With local metadata becoming immutable we need to block on the write.

ivankelly pushed a commit that referenced this issue Aug 13, 2018
Descriptions of the changes in this PR:

The Original intent of this change is to do a best-effort ensemble change.
But this is not possible until the local metadata is completely immutable.
Until the feature "Make LedgerMetadata Immutable #610" Is complete we will use
handleBookieFailure() to handle delayed writes as regular bookie failures.

Signed-off-by: Venkateswararao Jujjuri (JV) <vjujjurisalesforce.com>

Master Issue: #1591
Relate Issue: #1395

Author: JV Jujjuri <vjujjuri@salesforce.com>
Author: Ivan Kelly <ivank@apache.org>

Reviewers: Ivan Kelly <ivank@apache.org>, Sijie Guo <sijie@apache.org>

This closes #1592 from jvrao/datalossbug
ivankelly pushed a commit that referenced this issue Aug 13, 2018
Descriptions of the changes in this PR:

The Original intent of this change is to do a best-effort ensemble change.
But this is not possible until the local metadata is completely immutable.
Until the feature "Make LedgerMetadata Immutable #610" Is complete we will use
handleBookieFailure() to handle delayed writes as regular bookie failures.

Signed-off-by: Venkateswararao Jujjuri (JV) <vjujjurisalesforce.com>

Master Issue: #1591
Relate Issue: #1395

Author: JV Jujjuri <vjujjuri@salesforce.com>
Author: Ivan Kelly <ivank@apache.org>

Reviewers: Ivan Kelly <ivank@apache.org>, Sijie Guo <sijie@apache.org>

This closes #1592 from jvrao/datalossbug

(cherry picked from commit 3ab6e92)
Signed-off-by: Ivan Kelly <ivank@apache.org>
@athanatos
Copy link

@jvrao I think the idea is that once you make an ensemble change beginning at entry e, you must defer responding to the client on any entry e' >= e until the metadata update is durable in zookeeper.

reddycharan pushed a commit to reddycharan/bookkeeper that referenced this issue Aug 20, 2018
* Avoid releasing sent buffer to early in BookieClient mock

If the buffer was sent to more than one bookie with the mock, it would
be released after being sent to the first one. Each write should
retain a refCount themselves, and then release when done.

Author: Ivan Kelly <ivank@apache.org>

Reviewers: Sijie Guo <sijie@apache.org>

This closes apache#1598 from ivankelly/double-rel-mock

* (@bug W-5344681@) Delayed write ensemble change may cause dataloss

Descriptions of the changes in this PR:

The Original intent of this change is to do a best-effort ensemble change.
But this is not possible until the local metadata is completely immutable.
Until the feature "Make LedgerMetadata Immutable apache#610" Is complete we will use
handleBookieFailure() to handle delayed writes as regular bookie failures.

Signed-off-by: Venkateswararao Jujjuri (JV) <vjujjurisalesforce.com>

Master Issue: apache#1591
Relate Issue: apache#1395

Author: JV Jujjuri <vjujjuri@salesforce.com>
Author: Ivan Kelly <ivank@apache.org>

Reviewers: Ivan Kelly <ivank@apache.org>, Sijie Guo <sijie@apache.org>
@Rev Sam Just@

This closes apache#1592 from jvrao/datalossbug
@sijie
Copy link
Member

sijie commented Aug 21, 2018

This is fixed by #1592

@sijie sijie closed this as completed Aug 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants