Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indefinite period of transaction conflicts after network partitions #2159

Closed
aphyr opened this issue Feb 22, 2018 · 2 comments
Closed

Indefinite period of transaction conflicts after network partitions #2159

aphyr opened this issue Feb 22, 2018 · 2 comments
Assignees
Labels
kind/bug Something is broken.
Milestone

Comments

@aphyr
Copy link

aphyr commented Feb 22, 2018

After a period of network partitions, Dgraph 1.0.3-dev (5563bd2) can wind up stuck in a mode where all transactions which attempt to modify a key conflict immediately (e.g. on mutate, not commit). Using the schema

value: [int] .

... we build up a set of integers by performing mutations associating each integer with a fixed UID; e.g. to insert the number 5, we execute a transaction with a single mutation:

{uid: "0x01"
 value: 5}

To read this set, we query for all values associated with that UID using { q(func: uid($u)) { uid, value } }. As we saw in #2152, this appears to show a stale state of the DB: all values up to some number are present, then every subsequent acknowledged value is missing. To distinguish between stale reads and lost updates, we follow that read, in the same transaction, with a sequence of inserts or deletes to the exact triples which we believe were successfully inserted--if we read the 5, we insert {uid: "0x01", value: 5}, and if we failed to read 5, we delete {uid: "0x01", value: 5} instead. These update transactions fail immediately with a conflict.

INFO [2018-02-22 10:45:23,629] jepsen worker 0 - jepsen.util 140	:invoke	:read	nil
INFO [2018-02-22 10:45:23,631] jepsen worker 0 - jepsen.dgraph.set Forcing conflict by deleting 0
INFO [2018-02-22 10:45:23,678] jepsen worker 0 - jepsen.util 140	:fail	:read	nil	:conflict
INFO [2018-02-22 10:45:23,849] jepsen worker 9 - jepsen.util 169	:invoke	:read	nil
INFO [2018-02-22 10:45:23,851] jepsen worker 9 - jepsen.dgraph.set Forcing conflict by deleting 0
INFO [2018-02-22 10:45:23,875] jepsen worker 9 - jepsen.util 169	:fail	:read	nil	:conflict
INFO [2018-02-22 10:45:25,000] jepsen worker 6 - jepsen.util 106	:invoke	:read	nil
INFO [2018-02-22 10:45:25,002] jepsen worker 6 - jepsen.dgraph.set Forcing conflict by deleting 0
INFO [2018-02-22 10:45:25,048] jepsen worker 6 - jepsen.util 106	:fail	:read	nil	:conflict
INFO [2018-02-22 10:45:30,837] jepsen worker 7 - jepsen.util 157	:invoke	:read	nil
INFO [2018-02-22 10:45:30,839] jepsen worker 7 - jepsen.dgraph.set Forcing conflict by deleting 0
INFO [2018-02-22 10:45:30,885] jepsen worker 7 - jepsen.util 157	:fail	:read	nil	:conflict
INFO [2018-02-22 10:45:38,448] jepsen worker 3 - jepsen.util 153	:invoke	:read	nil
INFO [2018-02-22 10:45:38,450] jepsen worker 3 - jepsen.dgraph.set Forcing conflict by deleting 0
INFO [2018-02-22 10:45:38,481] jepsen worker 3 - jepsen.util 153	:fail	:read	nil	:conflict

This state appears to persist indefinitely--in an hour without any network disruption, and no other transactions, every update failed in this way.

In an optimistic concurrency control system, we would expect these updates to fail if another transaction modified and committed that key some time after our update transaction began and before it completed. If transaction start times were allocated sequentially, the conflict-failure of n sequential updates to the same key implies the existence of n ongoing update transactions affecting that key, but in this test, we have no such evidence. There are any number of timed-out transactions that might be applied just in time to cause these failures, but eventually we should exhaust those.

Alternatively, these update transactions could be obtaining starting timestamps from some point far in the past, such that they conflict with an update transaction that completed long ago.

To reproduce this behavior, try Jepsen d8bb86a5219d17abe5a4125581350571c5ffe209, and run

lein run test -f --package-url https://github.com/dgraph-io/dgraph/releases/download/nightly/dgraph-linux-amd64.tar.gz -w uid-set --time-limit 120 --nemesis partition-random-halves --concurrency 2n --test-count 10
@manishrjain manishrjain added the kind/bug Something is broken. label Feb 22, 2018
@janardhan1993 janardhan1993 self-assigned this Mar 7, 2018
@manishrjain manishrjain added the kind/bug Something is broken. label Mar 21, 2018
@janardhan1993
Copy link
Contributor

This issue might be due to readTs not moving forward on leader changes, should be fixed in #2261

@manishrjain manishrjain added this to the Sprint-000 milestone Mar 28, 2018
@janardhan1993
Copy link
Contributor

Fixed in master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something is broken.
Development

No branches or pull requests

3 participants