Deferred failure handling can cause data loss #1591

ivankelly · 2018-08-09T14:41:44Z

The bookkeeper client has a feature where if you have a ledger with a write quorum(Qw) larger than an ack quorum(Qa), such as 3:3:2, if (Qw-Qa) bookies return an error, after the entry add has completed, the erroring bookie will be replaced in the ensemble in the background.

This can cause data loss.

Consider a 3:3:2 ledger. Assume zookeeper is not accepting writes.

Start ensemble is b1,b2,b3

e0 is written to b1,b2,b3
b2 & b3 acknowledge, e0 request completes
b1 responds with and error and is added to the delayed ensemble change list
e1 is written, delayed error handling kicks off, b1 is replaced, so e1 is written to b4,b2,b3 while the client tries to update the metadata in zookeeper. zookeeper is stalled so fails to respond.
repeat this sequence for each of b2 and b3.

As each bookie fails, it will be replaced and writes will be acknowledged. Eventually the ensembles will look something like

0: b1,b2,b3
1: b4,b2,b3
3: b4,b5,b3
4: b4,b5,b6

How ever, this is only local, zookeeper still only has the initial ensemble. So, even though all entries from 4 onwards are acknowledged successfully to the client, if another client comes to read the ledger, they will not see them. If the other client recovers the ledger, the data is lost. TOAB violation.

Here's a test case which triggers the issue:
https://github.com/ivankelly/bookkeeper/blob/15bc251d46d5cd5fcceef130c0046eeacbe446cc/bookkeeper-server/src/test/java/org/apache/bookkeeper/client/TestDeferredFailure.java

The text was updated successfully, but these errors were encountered:

sijie · 2018-08-09T18:23:52Z

@jvrao can you take a look at this issue? this is related to #1395 .

ivankelly · 2018-08-09T18:28:35Z

@sijie @jvrao the core issue as I see it, is that we are allowing adds to be completed while the change in the progress. While this is nice from a latency point of view, it breaks our guarantees. A quick fix would be to block completions the same way we do for normal error recovery.

ivankelly · 2018-08-09T18:31:52Z

Also, we should only allow a single ensemble change at a time per ledger. The case where you need more than one is rare enough that people can live with the latency hit. It would solve the scenario in the test case posted, though it wouldn't solve the general case (2:2:1 will still have this type of failure).

jvrao · 2018-08-09T19:17:23Z

Excellent catch Ivan. Thanks. I think we should restrict one outstanding ensemble change in all cases. Which is clean.

…

On Thu, Aug 9, 2018 at 11:31 AM, Ivan Kelly ***@***.***> wrote: Also, we should only allow a single ensemble change at a time per ledger. The case where you need more than one is rare enough that people can live with the latency hit. It would solve the scenario in the test case posted, though it wouldn't solve the general case (2:2:1 will still have this type of failure). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1591 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAChrvhC4p-7Cs46Z9EhkXZ7CbGyZAsdks5uPIAYgaJpZM4V11eR> .

-- Jvrao --- First they ignore you, then they laugh at you, then they fight you, then you win. - Mahatma Gandhi

sijie · 2018-08-09T19:39:46Z

A quick fix would be to block completions the same way we do for normal error recovery.

I think a quick fix is handleDelayedWriteBookieFailure should increment blockAddCompletions to block completions.

jvrao · 2018-08-09T20:04:17Z

I will send a patch asap.

ivankelly · 2018-08-09T21:37:32Z

@sijie exactly, the blockAddCompletions stuff should fix this.
If handleDelayedWriteBookieFailure does the blocking, then there's very little to differentiate it from handleBookieFailures, so the logic can be pretty much the same.

ivankelly · 2018-08-09T21:45:17Z

I think we should restrict one outstanding ensemble change in all cases.

@jvrao just to reemphasize, this restriction, while simplifying things won't solve this problem if adds can complete while the change is happening.

The 2:2:1 case I was talking about is.

Two bookies, b1, b2
Write to b1 & b2
b1 acks successfully, b2 hangs, as ack quorum is 1 write is acknowledged
ensemble changes b2 for b3, but stalls writing to zk
b1 hangs
a bunch of entries are written, as b3 is in the local ensemble and ack quorum is 1, these entries are all written
Client crashes before ever being able to update the zk metadata.

So, we should absolutely make it so a single ensemble change occurs at a time (it should be easier now with CompletableFuture and stuff. This wasn't available last time this had a major revisit).

But the fix should be blocking add completions while ensemble change is in progress. If, by the time I start tomorrow you haven't a started on a patch, i'll submit something.

jvrao · 2018-08-09T21:58:25Z

For now the fix is to call handleBookieFailure()

    void handleDelayedWriteBookieFailure() {
        handleBookieFailure(delayedWriteFailedBookies);
    }

Until @ivankelly 's immutable local metadata comes in. The original intent of this change
is to do a best-effort ensemble change. Since we are mutating local metadata we can't ignore
failures. So I will make this change and checkin. But I strongly second the proposal of only one outstanding metadata change at any time for any of our changes, hold the writer until we hear back. @ivankelly you can submit that change and revert this back.

ivankelly · 2018-08-09T22:00:14Z

For completeness, the same scenario with immutable metadata would be.

Two bookie ensemble, b1,b2, ensemble is 2:2:1

Write e0 to b1 (success), b2 (fail after write complete)
b2 replaced in with ensemble 1: b1, b3, write sent to zookeeper
Write e1 to b1 (fail for some reason), b2 (success for some reason)
Write to zookeeper completes

So e1 is only on b2, which if the ensemble is 1: b1, b3, can never be read back by a client. Again the source of the issue is allowing add acknowledgement while changing ensemble.

ivankelly · 2018-08-09T22:04:11Z

@jvrao

The original intent of this change is to do a best-effort ensemble change

With immutable this should work quite well. There's 2 cases, a) there's more bookies available, b) there's no more bookies available.

With a) we block completions while replacing.
With b) we don't modify anything locally because there's nothing to modify with. completion can continue on the old quorum without issue (though this needs to be taken into account when we make the change, we shouldn't block completions until we know there are bookies available for the replacement).

ivankelly · 2018-08-09T22:04:36Z

@jvrao fix looks good btw 👍

jvrao · 2018-08-09T22:37:37Z

a) we block completions while replacing

How does that protect the scenario you mentioned with "Write e1 to b1 (fail for some reason), b2 (success for some reason)" Unless you block writes while the proposed change to the metadata is outstanding? In the current code, local metadata is already changed hence the e1 would have gone on b1,b3 so blocking on the completion is good enough. With local metadata becoming immutable we need to block on the write.

Descriptions of the changes in this PR: The Original intent of this change is to do a best-effort ensemble change. But this is not possible until the local metadata is completely immutable. Until the feature "Make LedgerMetadata Immutable #610" Is complete we will use handleBookieFailure() to handle delayed writes as regular bookie failures. Signed-off-by: Venkateswararao Jujjuri (JV) <vjujjurisalesforce.com> Master Issue: #1591 Relate Issue: #1395 Author: JV Jujjuri <vjujjuri@salesforce.com> Author: Ivan Kelly <ivank@apache.org> Reviewers: Ivan Kelly <ivank@apache.org>, Sijie Guo <sijie@apache.org> This closes #1592 from jvrao/datalossbug

Descriptions of the changes in this PR: The Original intent of this change is to do a best-effort ensemble change. But this is not possible until the local metadata is completely immutable. Until the feature "Make LedgerMetadata Immutable #610" Is complete we will use handleBookieFailure() to handle delayed writes as regular bookie failures. Signed-off-by: Venkateswararao Jujjuri (JV) <vjujjurisalesforce.com> Master Issue: #1591 Relate Issue: #1395 Author: JV Jujjuri <vjujjuri@salesforce.com> Author: Ivan Kelly <ivank@apache.org> Reviewers: Ivan Kelly <ivank@apache.org>, Sijie Guo <sijie@apache.org> This closes #1592 from jvrao/datalossbug (cherry picked from commit 3ab6e92) Signed-off-by: Ivan Kelly <ivank@apache.org>

athanatos · 2018-08-13T17:56:57Z

@jvrao I think the idea is that once you make an ensemble change beginning at entry e, you must defer responding to the client on any entry e' >= e until the metadata update is durable in zookeeper.

@bug

* Avoid releasing sent buffer to early in BookieClient mock If the buffer was sent to more than one bookie with the mock, it would be released after being sent to the first one. Each write should retain a refCount themselves, and then release when done. Author: Ivan Kelly <ivank@apache.org> Reviewers: Sijie Guo <sijie@apache.org> This closes apache#1598 from ivankelly/double-rel-mock * (@bug W-5344681@) Delayed write ensemble change may cause dataloss Descriptions of the changes in this PR: The Original intent of this change is to do a best-effort ensemble change. But this is not possible until the local metadata is completely immutable. Until the feature "Make LedgerMetadata Immutable apache#610" Is complete we will use handleBookieFailure() to handle delayed writes as regular bookie failures. Signed-off-by: Venkateswararao Jujjuri (JV) <vjujjurisalesforce.com> Master Issue: apache#1591 Relate Issue: apache#1395 Author: JV Jujjuri <vjujjuri@salesforce.com> Author: Ivan Kelly <ivank@apache.org> Reviewers: Ivan Kelly <ivank@apache.org>, Sijie Guo <sijie@apache.org> @Rev Sam Just@ This closes apache#1592 from jvrao/datalossbug

sijie · 2018-08-21T18:46:08Z

This is fixed by #1592

sijie added type/bug area/client release/4.8.0 release/4.7.2 labels Aug 9, 2018

sijie added this to the 4.8.0 milestone Aug 9, 2018

jvrao mentioned this issue Aug 9, 2018

Delayed write ensemble change may cause dataloss #1592

Closed

sijie assigned jvrao Aug 21, 2018

sijie closed this as completed Aug 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deferred failure handling can cause data loss #1591

Deferred failure handling can cause data loss #1591

ivankelly commented Aug 9, 2018

sijie commented Aug 9, 2018

ivankelly commented Aug 9, 2018

ivankelly commented Aug 9, 2018

jvrao commented Aug 9, 2018 via email

sijie commented Aug 9, 2018

jvrao commented Aug 9, 2018

ivankelly commented Aug 9, 2018

ivankelly commented Aug 9, 2018

jvrao commented Aug 9, 2018

ivankelly commented Aug 9, 2018

ivankelly commented Aug 9, 2018

ivankelly commented Aug 9, 2018

jvrao commented Aug 9, 2018

athanatos commented Aug 13, 2018

sijie commented Aug 21, 2018

Deferred failure handling can cause data loss #1591

Deferred failure handling can cause data loss #1591

Comments

ivankelly commented Aug 9, 2018

sijie commented Aug 9, 2018

ivankelly commented Aug 9, 2018

ivankelly commented Aug 9, 2018

jvrao commented Aug 9, 2018 via email

sijie commented Aug 9, 2018

jvrao commented Aug 9, 2018

ivankelly commented Aug 9, 2018

ivankelly commented Aug 9, 2018

jvrao commented Aug 9, 2018

ivankelly commented Aug 9, 2018

ivankelly commented Aug 9, 2018

ivankelly commented Aug 9, 2018

jvrao commented Aug 9, 2018

athanatos commented Aug 13, 2018

sijie commented Aug 21, 2018