Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot isolation violation with server-side sequencing #2391

Closed
aphyr opened this issue May 16, 2018 · 7 comments
Closed

Snapshot isolation violation with server-side sequencing #2391

aphyr opened this issue May 16, 2018 · 7 comments
Labels

Comments

@aphyr
Copy link

@aphyr aphyr commented May 16, 2018

On Dgraph 1.0.5-dev (7796a40), overlapping-ring network partitions may occasionally result in read skew, despite using both server-side sequencing and @upsert directives. This occurred on a five-node cluster, where all five nodes run both zero and alpha, and with replica count five. This problem is currently difficult to reproduce, but since it occurred in a single-group cluster where tablet moves are impossible, I believe it's a separate bug from #2321.

I have only two cases of this so far: 20180511T185232.000-0500.zip, and 20180516T095744.000-0500.zip
/20180516T095744.000-0500.zip).

The first is more dramatic: a single node, n1, has read transactions which observe up to 30% lower totals across all accounts for tens of seconds. 95 out of 1346 total reads violated snapshot isolation. The first invalid read begins six seconds after the resolution of a network partition, and before the next partition began; I suspect there may be some relationship with either a zero or alpha topology change, but without more data it's hard to infer more.

bank

The second case had n3 jump from 100 to 102 for a brief period.

bank

The good news is that unlike earlier SI violations, these appear to be limited to single nodes, and don't result in values drifting over time, so it may only affect read safety and not updates.

@aphyr
Copy link
Author

@aphyr aphyr commented May 16, 2018

Snagged a third case; also limited to a single node. Nothing obvious in the logs yet, though. Looks like n3 went wonky somehow. 20180516T135225.000-0500 (1).zip. There's a long period of "ignored a MsgAppResp message with lower term" messages on n3, but that also occurred on n2, so it might be unrelated.

bank 1

@manishrjain
Copy link
Member

@manishrjain manishrjain commented May 16, 2018

I have a hunch for what could be happening here. In particular, the lin read server-side sequencing should probably occur after waiting for Zero Transaction stream to catch up.

I'd like to try it out, but if this happens very rarely, it would be hard to test.

@manishrjain
Copy link
Member

@manishrjain manishrjain commented May 18, 2018

Running tests to see if this is still a bug.

Set --test-count 20 --time-limit 300 --replicas 5, dgraph unknown bank s=server upsert nemesis=partition-ring.

@aphyr
Copy link
Author

@aphyr aphyr commented May 23, 2018

We've added code to disable predicate moves entirely, and found that even in healthy clusters with no predicate moves, Dgraph can occasionally return incorrect balances, or nil values for predicates. This is on v1.0.5-dev, 7796a40: 20180522T232334.000-0500 (1).zip

bank 1

You can reproduce this with Jepsen d87abed0561621e8f94ce1d6659ce7488b3fcd31

lein run test --package-url https://github.com/dgraph-io/dgraph/releases/download/nightly/dgraph-linux-amd64.tar.gz --force-download --nemesis none --rebalance-interval 10h --sequencing server --upsert-schema --time-limit 600 --concurrency 1n --workload delete --retry-db-setup --test-count 20
@mkcp
Copy link

@mkcp mkcp commented Sep 5, 2018

We've been running additional bank tests on healthy clusters (no nemesis-induced network partitions) against v1.0.7, v1.0.8-rc1, and v1.0.8 and have not had any failing test cases in these versions. We still get incorrect balances, resulting in failing tests, when we partition the network. Progress! 🎉

manishrjain added a commit that referenced this issue Nov 20, 2018
Got it. Fixed it. Squashed it. Done with it. Tamed it. Time to bask in the glory of victory!

Fixed Jepsen bank test violation during a network partition. The violation was happening because:
1. Current Zero leader receives a commit request and assigns a commit timestamp.
2. Gets partitioned out, so the proposal gets blocked.
3. Another Zero becomes the leader and renews the txn timestamp lease, starting at a much higher number (previous lease + lease bandwidth of 10K).
4. The new leader services new txn requests, which advances all Alphas to a much higher MaxApplied timestamp. 
5. The previous leader, who is now the follower, retries the old commit proposal and succeeds. This causes 2 issues.
  a) A later txn read doesn't see this commit, and advances.
  b) Alpha rejects a write on a posting list key, which already has a new commit at higher ts.
Both of the scenarios caused bank txn test violation.

Open Census and Dgraph debug disect was instrumental in determining the cause of this violation. The tell-tale sign was noticing a /Commit timeout of one of the penultimate commits, to the violating commit.

Fixes:

0. Use OpenCensus as a torch to light every path that a txn took, to determine the cause.
1. Do not allow a context deadline when proposing a txn commit.
2. Do not allow Zero follower to propagate proposals to Zero leader.

Tested this PR to fix issue #2391 . Tested with partition-ring and clock-skew nemesis in a 10-node cluster. More testing is needed around predicate moves.

Changelog:

* Trial: Do not purge anything (needs to be refactored and reintroduced before a release).
* More debugging, more tracing for Jepsen.
* Opencensus in Zero.
* One fix for Jepsen test, which ensures that a context deadline cannot just cancel a txn proposal. Need some refactoring of how Zero gets membership updates, now that we need Zero to not forward proposals to the leader.
* Update Raft lib, so we have access to the feature to disallow Raft proposal forwarding.
* Update raftpb proto package as well
* Dirty changes to ensure that Zero followers don't forward proposals to Zero leader.
* Various Opencensus integration changes.
@manishrjain
Copy link
Member

@manishrjain manishrjain commented Nov 22, 2018

@danielmai tested this with:

for i in $(seq 1 10)
do
    lein run test \
         --local-binary /gobin/dgraph \
         --force-download \
         --nemesis fix-alpha,kill-alpha,kill-zero,partition-halves,partition-ring,skew-clock \
         --skew big \
         --workload bank \
         --rebalance-interval 10h \
         --upsert-schema \
         --time-limit 3600 \
         --concurrency 30 \
         --replicas 3 \
         --test-count 1 \
         --dgraph-jaeger-collector http://jaeger:14268 \
         --tracing http://jaeger:14268/api/traces
    sleep 60
done

screenshot from 2018-11-21 10-55-24

All passed! Finally, squashed this bug! Closing this issue.

@manishrjain
Copy link
Member

@manishrjain manishrjain commented Nov 22, 2018

This fix would be part of v1.0.11 release.

dna2github added a commit to dna2fork/dgraph that referenced this issue Jul 19, 2019
Got it. Fixed it. Squashed it. Done with it. Tamed it. Time to bask in the glory of victory!

Fixed Jepsen bank test violation during a network partition. The violation was happening because:
1. Current Zero leader receives a commit request and assigns a commit timestamp.
2. Gets partitioned out, so the proposal gets blocked.
3. Another Zero becomes the leader and renews the txn timestamp lease, starting at a much higher number (previous lease + lease bandwidth of 10K).
4. The new leader services new txn requests, which advances all Alphas to a much higher MaxApplied timestamp. 
5. The previous leader, who is now the follower, retries the old commit proposal and succeeds. This causes 2 issues.
  a) A later txn read doesn't see this commit, and advances.
  b) Alpha rejects a write on a posting list key, which already has a new commit at higher ts.
Both of the scenarios caused bank txn test violation.

Open Census and Dgraph debug disect was instrumental in determining the cause of this violation. The tell-tale sign was noticing a /Commit timeout of one of the penultimate commits, to the violating commit.

Fixes:

0. Use OpenCensus as a torch to light every path that a txn took, to determine the cause.
1. Do not allow a context deadline when proposing a txn commit.
2. Do not allow Zero follower to propagate proposals to Zero leader.

Tested this PR to fix issue dgraph-io#2391 . Tested with partition-ring and clock-skew nemesis in a 10-node cluster. More testing is needed around predicate moves.

Changelog:

* Trial: Do not purge anything (needs to be refactored and reintroduced before a release).
* More debugging, more tracing for Jepsen.
* Opencensus in Zero.
* One fix for Jepsen test, which ensures that a context deadline cannot just cancel a txn proposal. Need some refactoring of how Zero gets membership updates, now that we need Zero to not forward proposals to the leader.
* Update Raft lib, so we have access to the feature to disallow Raft proposal forwarding.
* Update raftpb proto package as well
* Dirty changes to ensure that Zero followers don't forward proposals to Zero leader.
* Various Opencensus integration changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.