New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible loss of all inserts after network partition #2152

Closed
aphyr opened this Issue Feb 21, 2018 · 10 comments

Comments

Projects
None yet
5 participants
@aphyr

aphyr commented Feb 21, 2018

On the current Dgraph nightly (2018/02/21), it appears that network partitions may cause the loss (or indefinitely-deferred visibility) of all acknowledged inserts after some critical event. This may be related to #2159: mutate transactions may conflict indefinitely after partitions.

We use the following schema:

type: string @index(exact) .
value: int @index(int) .

And perform inserts of single records like {type: "element", :value 123}, where values are sequential integers, during simple majority/minority network partitions with a 2-seconds-on/2-seconds-off schedule. We then heal the network and allow ten minutes for the cluster to recover, followed by a query for all elements by each client:

{ q(func: eq(type, "element")) { uid, value } }

These reads can fail to return all acknowledged documents after some critical time:

 {:valid? false,
  :lost
  "#{140 149 151..155 169..174 176..178 183 186 189 191 196 200..201 203 206 208..210 212 214..227 229..237 242 244 251 254 257..259 261..262 265 267 269..271 277..278 461..466 469 472..473 475 477..483 488..489 491 494..495 497..498 502..505 507 510 512..513 516 518 521..522 525 528 532 537..538 541 544 547 550 675 685..686 688..690 2263 2266..2269 2272..2273 2275}",
  :recovered "#{}",
  :ok
  "#{0 6 10 12 18 23 25 29 32 34 37..39 42 45 48 54 58 62..63 67 71 73 75 77 81 84 86 90 106 109..115 129 137 279 284}",
  :recovered-frac 0,
  :unexpected-frac 0,
  :unexpected "#{}",
  :lost-frac 127/2373,
  :ok-frac 41/2373},

In this case, 41 acknowledged writes from 0 to 284 were found in the final read, but 127 acknowledged writes from 140 to 2275 were not found. The inserted records were assigned UIDs, so I don't think it's a return-code issue. I think this is, strictly speaking, legal under SI; snapshot isolation allows arbitrarily stale reads, and I thiiiink that should apply to index reads as well.

This behavior is accompanied by a 100% CPU jump on one node in the cluster for several minutes, sometimes beginning when the network heals, sometimes not until a read. Possibly associated with a tablet handoff? pprof.dgraph.samples.cpu.001.pb.gz

You can reproduce this behavior with Jepsen 2371d2ac8c0732dfa9368e3f54eba1a2c5cf3895, using the nightlies (or perhaps 1.0.3 or 1.0.4), by running

lein run test -w set -f --package-url https://github.com/dgraph-io/dgraph/releases/download/nightly/dgraph-linux-amd64.tar.gz --time-limit 300 --nemesis partition-random-halves --concurrency 2n

One UID, no indices

This behavior also manifests without indices. We use this schema:

value: [int] .

... and insert elements by associating them all with the same UID, e.g.

{"uid": "0x01",
 "value": 1}
{"uid": "0x01",
 "value": 2}

... and so on. We read all inserted values with { q(func: uid(0x01)) { uid, value } }. When partitions occur, the same problem can arise: all successfully inserted tuples after some point are missing from the final read. This is present on v1.0.3-dev (5563bd2), and can be reproduced with Jepsen ce4d1e839f53a1146cb81c3b3d4e247d1e490d7c by running

lein run test -f --package-url https://github.com/dgraph-io/dgraph/releases/download/nightly/dgraph-linux-amd64.tar.gz -w uid-set --time-limit 120 --nemesis partition-random-halves --concurrency 2n
@janardhan1993

This comment has been minimized.

Show comment
Hide comment
@janardhan1993

janardhan1993 Mar 22, 2018

Contributor

@aphyr : I am getting the following error sometimes when running
lein run test --package-url https://transfer.sh/TjHBo/dgraph-linux-amd64.tar.gz -w set --time-limit 300 --concurrency 2n --nemesis partition-random-halves.
Any idea why it might be happening ?

WARN [2018-03-22 11:45:32,797] clojure-agent-send-off-pool-31 - jepsen.checker Error while checking history:
java.lang.ClassCastException: clojure.lang.ArraySeq cannot be cast to clojure.lang.IPersistentSet
	at clojure.core$disj.invokeStatic(core.clj:1517) ~[clojure-1.9.0.jar:na]
	at clojure.set$intersection$fn__8665.invoke(set.clj:41) ~[clojure-1.9.0.jar:na]
	at clojure.lang.ArraySeq.reduce(ArraySeq.java:109) ~[clojure-1.9.0.jar:na]
	at clojure.core$reduce.invokeStatic(core.clj:6747) ~[clojure-1.9.0.jar:na]
	at clojure.set$intersection.invokeStatic(set.clj:38) ~[clojure-1.9.0.jar:na]
	at clojure.set$intersection.invoke(set.clj:33) ~[clojure-1.9.0.jar:na]
	at jepsen.checker$set$reify__3616.check(checker.clj:190) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
	at jepsen.checker$check_safe.invokeStatic(checker.clj:71) [jepsen-0.1.9-SNAPSHOT.jar:na]
	at jepsen.checker$check_safe.invoke(checker.clj:64) [jepsen-0.1.9-SNAPSHOT.jar:na]
	at jepsen.checker$compose$reify__3583$fn__3585.invoke(checker.clj:87) [jepsen-0.1.9-SNAPSHOT.jar:na]
	at clojure.core$pmap$fn__8105$fn__8106.invoke(core.clj:6942) [clojure-1.9.0.jar:na]
	at clojure.core$binding_conveyor_fn$fn__5476.invoke(core.clj:2022) [clojure-1.9.0.jar:na]
	at clojure.lang.AFn.call(AFn.java:18) [clojure-1.9.0.jar:na]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_151]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_151]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_151]
	at java.lang.Thread.run(Thread.java:748) [na:1.8.0_151]
INFO [2018-03-22 11:45:33,033] jepsen test runner - jepsen.core Analysis complete
INFO [2018-03-22 11:45:33,064] jepsen results - jepsen.store Wrote /jepsen/dgraph/store/dgraph 1.0.3 set partition-random-halves/20180322T112856.000Z/results.edn
INFO [2018-03-22 11:45:33,101] main - jepsen.core {:perf
 {:latency-graph {:valid? true},
  :rate-graph {:valid? true},
  :valid? true},
 :workload
 {:valid? :unknown,
  :error
  "java.lang.ClassCastException: clojure.lang.ArraySeq cannot be cast to clojure.lang.IPersistentSet\n at clojure.core$disj.invokeStatic (core.clj:1517)\n    clojure.set$intersection$fn__8665.invoke (set.clj:41)\n    clojure.lang.ArraySeq.reduce (ArraySeq.java:109)\n    clojure.core$reduce.invokeStatic (core.clj:6747)\n    clojure.set$intersection.invokeStatic (set.clj:38)\n    clojure.set$intersection.invoke (set.clj:33)\n    jepsen.checker$set$reify__3616.check (checker.clj:190)\n    jepsen.checker$check_safe.invokeStatic (checker.clj:71)\n    jepsen.checker$check_safe.invoke (checker.clj:64)\n    jepsen.checker$compose$reify__3583$fn__3585.invoke (checker.clj:87)\n    clojure.core$pmap$fn__8105$fn__8106.invoke (core.clj:6942)\n    clojure.core$binding_conveyor_fn$fn__5476.invoke (core.clj:2022)\n    clojure.lang.AFn.call (AFn.java:18)\n    java.util.concurrent.FutureTask.run (FutureTask.java:266)\n    java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1149)\n    java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:624)\n    java.lang.Thread.run (Thread.java:748)\n"},
 :valid? :unknown}
Contributor

janardhan1993 commented Mar 22, 2018

@aphyr : I am getting the following error sometimes when running
lein run test --package-url https://transfer.sh/TjHBo/dgraph-linux-amd64.tar.gz -w set --time-limit 300 --concurrency 2n --nemesis partition-random-halves.
Any idea why it might be happening ?

WARN [2018-03-22 11:45:32,797] clojure-agent-send-off-pool-31 - jepsen.checker Error while checking history:
java.lang.ClassCastException: clojure.lang.ArraySeq cannot be cast to clojure.lang.IPersistentSet
	at clojure.core$disj.invokeStatic(core.clj:1517) ~[clojure-1.9.0.jar:na]
	at clojure.set$intersection$fn__8665.invoke(set.clj:41) ~[clojure-1.9.0.jar:na]
	at clojure.lang.ArraySeq.reduce(ArraySeq.java:109) ~[clojure-1.9.0.jar:na]
	at clojure.core$reduce.invokeStatic(core.clj:6747) ~[clojure-1.9.0.jar:na]
	at clojure.set$intersection.invokeStatic(set.clj:38) ~[clojure-1.9.0.jar:na]
	at clojure.set$intersection.invoke(set.clj:33) ~[clojure-1.9.0.jar:na]
	at jepsen.checker$set$reify__3616.check(checker.clj:190) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
	at jepsen.checker$check_safe.invokeStatic(checker.clj:71) [jepsen-0.1.9-SNAPSHOT.jar:na]
	at jepsen.checker$check_safe.invoke(checker.clj:64) [jepsen-0.1.9-SNAPSHOT.jar:na]
	at jepsen.checker$compose$reify__3583$fn__3585.invoke(checker.clj:87) [jepsen-0.1.9-SNAPSHOT.jar:na]
	at clojure.core$pmap$fn__8105$fn__8106.invoke(core.clj:6942) [clojure-1.9.0.jar:na]
	at clojure.core$binding_conveyor_fn$fn__5476.invoke(core.clj:2022) [clojure-1.9.0.jar:na]
	at clojure.lang.AFn.call(AFn.java:18) [clojure-1.9.0.jar:na]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_151]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_151]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_151]
	at java.lang.Thread.run(Thread.java:748) [na:1.8.0_151]
INFO [2018-03-22 11:45:33,033] jepsen test runner - jepsen.core Analysis complete
INFO [2018-03-22 11:45:33,064] jepsen results - jepsen.store Wrote /jepsen/dgraph/store/dgraph 1.0.3 set partition-random-halves/20180322T112856.000Z/results.edn
INFO [2018-03-22 11:45:33,101] main - jepsen.core {:perf
 {:latency-graph {:valid? true},
  :rate-graph {:valid? true},
  :valid? true},
 :workload
 {:valid? :unknown,
  :error
  "java.lang.ClassCastException: clojure.lang.ArraySeq cannot be cast to clojure.lang.IPersistentSet\n at clojure.core$disj.invokeStatic (core.clj:1517)\n    clojure.set$intersection$fn__8665.invoke (set.clj:41)\n    clojure.lang.ArraySeq.reduce (ArraySeq.java:109)\n    clojure.core$reduce.invokeStatic (core.clj:6747)\n    clojure.set$intersection.invokeStatic (set.clj:38)\n    clojure.set$intersection.invoke (set.clj:33)\n    jepsen.checker$set$reify__3616.check (checker.clj:190)\n    jepsen.checker$check_safe.invokeStatic (checker.clj:71)\n    jepsen.checker$check_safe.invoke (checker.clj:64)\n    jepsen.checker$compose$reify__3583$fn__3585.invoke (checker.clj:87)\n    clojure.core$pmap$fn__8105$fn__8106.invoke (core.clj:6942)\n    clojure.core$binding_conveyor_fn$fn__5476.invoke (core.clj:2022)\n    clojure.lang.AFn.call (AFn.java:18)\n    java.util.concurrent.FutureTask.run (FutureTask.java:266)\n    java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1149)\n    java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:624)\n    java.lang.Thread.run (Thread.java:748)\n"},
 :valid? :unknown}

@dgraph-bot dgraph-bot added the bug label Mar 27, 2018

@janardhan1993

This comment has been minimized.

Show comment
Hide comment
@janardhan1993

janardhan1993 Mar 27, 2018

Contributor

This issue might be due to readTs not moving forward on leader changes, should be fixed in #2261.
Will confirm after the above error is resolved.

Contributor

janardhan1993 commented Mar 27, 2018

This issue might be due to readTs not moving forward on leader changes, should be fixed in #2261.
Will confirm after the above error is resolved.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Mar 27, 2018

Hi @janardhan1993! Tracked down this bug too; was another issue in this test I introduced while working on a different dgraph test. Fixed now!

aphyr commented Mar 27, 2018

Hi @janardhan1993! Tracked down this bug too; was another issue in this test I introduced while working on a different dgraph test. Fixed now!

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Mar 28, 2018

I can confirm that v1.0.4 (807976c) from 2018-03-22 14:55:24 +1100 still loses inserted records in this test--here's another example: 20180327T135250.000-0500.zip

Also ran into a new sort of issue where a node got stuck forever during this test: #2273.

aphyr commented Mar 28, 2018

I can confirm that v1.0.4 (807976c) from 2018-03-22 14:55:24 +1100 still loses inserted records in this test--here's another example: 20180327T135250.000-0500.zip

Also ran into a new sort of issue where a node got stuck forever during this test: #2273.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Mar 28, 2018

It also looks like I can lock the system into some weird state where a key's timestamp is skewed far into the future, and it'll refuse to service reads for hours; you get a slow sequence of failures like:

INFO [2018-03-27 23:38:42,409] jepsen worker 9 - jepsen.util 3029	:info	:read	nil	indeterminate: UNKNOWN: : rpc error: code = Unknown desc = readTs: 13272 less than minTs: 30014 for key: "\x00\x00\x04type\x02\x02element"

The read timestamps do advance, but they only increment on reads, so I think it's gonna be quite a while before it catches up. It's also odd that this is happening with fresh clients each time; figured those could start anywhere in the timeline...

aphyr commented Mar 28, 2018

It also looks like I can lock the system into some weird state where a key's timestamp is skewed far into the future, and it'll refuse to service reads for hours; you get a slow sequence of failures like:

INFO [2018-03-27 23:38:42,409] jepsen worker 9 - jepsen.util 3029	:info	:read	nil	indeterminate: UNKNOWN: : rpc error: code = Unknown desc = readTs: 13272 less than minTs: 30014 for key: "\x00\x00\x04type\x02\x02element"

The read timestamps do advance, but they only increment on reads, so I think it's gonna be quite a while before it catches up. It's also odd that this is happening with fresh clients each time; figured those could start anywhere in the timeline...

@janardhan1993

This comment has been minimized.

Show comment
Hide comment
@janardhan1993

janardhan1993 Mar 28, 2018

Contributor

@aphyr I will get those changes in, i haven't merged the fixes yet. Will update once done.

Contributor

janardhan1993 commented Mar 28, 2018

@aphyr I will get those changes in, i haven't merged the fixes yet. Will update once done.

@janardhan1993

This comment has been minimized.

Show comment
Hide comment
@janardhan1993
Contributor

janardhan1993 commented Mar 28, 2018

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Mar 28, 2018

Sure thing. :)

aphyr commented Mar 28, 2018

Sure thing. :)

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Mar 28, 2018

Looks like we're still getting lockups ala #2273 with this build. :(

aphyr commented Mar 28, 2018

Looks like we're still getting lockups ala #2273 with this build. :(

@dgraph-bot dgraph-bot added the bug label Mar 28, 2018

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Mar 28, 2018

On the plus side, though, I'm not seeing lost inserts any more! That's good!

aphyr commented Mar 28, 2018

On the plus side, though, I'm not seeing lost inserts any more! That's good!

@manishrjain manishrjain added this to the Sprint-000 milestone Mar 28, 2018

@manishrjain manishrjain added the backlog label Mar 29, 2018

@manishrjain manishrjain removed the backlog label Mar 29, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment