New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SI violation with node crashes or predicate moves in bank tests #2321

Open
aphyr opened this Issue Apr 10, 2018 · 18 comments

Comments

Projects
None yet
4 participants
@aphyr

aphyr commented Apr 10, 2018

In healthy clusters, with @upsert schema directives on key and type attributes (but not on values), and server-side ordering set for all client transactions, version

Dgraph version   : v1.0.4
Commit SHA-1     : 6fb69e2
Commit timestamp : 2018-04-09 21:26:31 +0530
Branch           : jan/node_lockup

can exhibit a violation of snapshot isolation, allowing account values to drift higher or lower over time. This may be related to #2143 (the previous bank test failure without any nemesis, now closed), but I suspect this issue miiight be different than #2290, where documents are lost or have nil values instead of their correct data, because that would cause an undercount, and in bank tests, I can also observe overcounting. For instance, in 20180409T163239.000-0500.zip:

9	:invoke	:transfer	{:from 1, :to 3, :amount 2}
8	:ok	:read	{0 43, 1 4, 2 3, 3 3, 4 10, 5 7, 6 27, 7 3}
8	:invoke	:transfer	{:from 2, :to 6, :amount 4}
6	:fail	:transfer	{:from 7, :to 4, :amount 1}	:conflict
6	:invoke	:read	nil
3	:fail	:transfer	{:from 5, :to 7, :amount 1}	:conflict
3	:invoke	:read	nil
4	:ok	:read	{0 43, 1 4, 2 3, 3 3, 4 10, 5 7, 6 27, 7 3}
4	:invoke	:read	nil
2	:ok	:read	{0 43, 1 4, 2 5, 3 3, 4 10, 5 7, 6 25, 7 3}
6	:ok	:read	{0 43, 1 4, 2 3, 3 4, 4 10, 5 7, 6 27, 7 2}
2	:invoke	:read	nil
6	:invoke	:read	nil
6	:ok	:read	{0 43, 1 4, 2 3, 3 4, 4 10, 5 7, 6 27, 7 2}
6	:invoke	:read	nil
4	:ok	:read	{0 43, 1 4, 2 3, 3 4, 4 10, 5 7, 6 27, 7 2}
4	:invoke	:read	nil
3	:ok	:read	{0 43, 1 4, 2 3, 3 4, 4 10, 5 7, 6 27, 7 2}
3	:invoke	:read	nil
8	:fail	:transfer	{:from 2, :to 6, :amount 4}	:insufficient-funds
8	:invoke	:transfer	{:from 4, :to 7, :amount 5}
0	:ok	:transfer	{:from 1, :to 6, :amount 2}
0	:invoke	:read	nil
6	:ok	:read	{0 43, 1 2, 2 3, 3 4, 4 10, 5 7, 6 29, 7 2}
6	:invoke	:read	nil
5	:fail	:transfer	{:from 1, :to 7, :amount 3}	:conflict
5	:invoke	:read	nil
1	:fail	:transfer	{:from 6, :to 2, :amount 2}	:conflict
1	:invoke	:transfer	{:from 6, :to 2, :amount 2}
0	:ok	:read	{0 43, 1 2, 2 3, 3 4, 4 10, 5 7, 6 29, 7 2}
3	:ok	:read	{0 43, 1 2, 2 3, 3 4, 4 10, 5 7, 6 29, 7 2}
0	:invoke	:transfer	{:from 2, :to 0, :amount 2}
3	:invoke	:transfer	{:from 7, :to 1, :amount 4}
5	:ok	:read	{0 43, 1 2, 2 3, 3 4, 4 10, 5 7, 6 29, 7 2}
4	:ok	:read	{0 43, 1 2, 2 3, 3 4, 4 10, 5 7, 6 29, 7 2}
5	:invoke	:read	nil
4	:invoke	:transfer	{:from 6, :to 5, :amount 3}
2	:ok	:read	{0 43, 1 4, 2 3, 3 4, 4 10, 5 7, 6 27, 7 2}
2	:invoke	:read	nil
5	:ok	:read	{0 43, 1 2, 2 3, 3 4, 4 10, 5 7, 6 29, 7 2}
6	:ok	:read	{0 43, 1 2, 2 3, 3 4, 4 10, 5 7, 6 29, 7 2}
5	:invoke	:read	nil
6	:invoke	:read	nil
5	:ok	:read	{0 43, 1 2, 2 3, 3 4, 4 10, 5 7, 6 29, 7 2}
5	:invoke	:transfer	{:from 1, :to 0, :amount 5}
6	:ok	:read	{0 43, 1 2, 2 3, 3 4, 4 10, 5 7, 6 29, 7 2}
6	:invoke	:transfer	{:from 1, :to 3, :amount 5}
9	:ok	:transfer	{:from 1, :to 3, :amount 2}
9	:invoke	:transfer	{:from 0, :to 2, :amount 2}
0	:ok	:transfer	{:from 2, :to 0, :amount 2}
0	:invoke	:transfer	{:from 3, :to 5, :amount 5}
5	:fail	:transfer	{:from 1, :to 0, :amount 5}	:insufficient-funds
5	:invoke	:read	nil
2	:ok	:read	{0 43, 1 2, 2 3, 3 4, 4 10, 5 7, 6 29, 7 2}
2	:invoke	:transfer	{:from 0, :to 2, :amount 2}
6	:fail	:transfer	{:from 1, :to 3, :amount 5}	:insufficient-funds
6	:invoke	:transfer	{:from 1, :to 2, :amount 5}
5	:ok	:read	{0 45, 1 2, 2 1, 3 6, 4 10, 5 7, 6 29, 7 2}

Here, the final read has a total of 102: 2 has been transferred from account 2 to account 0, but account 3 has also gained 2 out of nowhere. The only recent increment of account 3 comes from process 9's transfer of 2 from 1 to 3, but it doesn't appear to have applied atomically?

These don't appear to just be read-only anomalies; they can be promoted via writes back into the database state--the value remains 102 for the remainder of the test.

You can reproduce this with Jepsen 0ef6e711dfb07aad4afc84f7f9c3348961afa9d7 by running

lein run test --package-url https://transfer.sh/Z5CTJ/dgraph-linux-amd64.tar.gz --time-limit 300 --concurrency 2n --nemesis kill-alpha,fix-alpha,kill-zero --test-count 20 --workload bank --upsert-schema
@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 10, 2018

I've collected a bit more evidence here. These plots show the observed sum of accounts over time, and colors represent the different nodes being queried. In a snapshot isolated system, the numbers should always be n (100). In some runs, the value periodically jumps to n+d, then returns to n for a time, then goes back to n+d, and so on--this feels like it could be a read skew anomaly that wasn't promoted via a write. In this test, the anomalous total was the same each time, and visible across multiple nodes. I'm not sure what to make of this.

bank

Another plot looks like this:

bank

Here, we see a period where accounts drift lower, then they start drifting higher again. The semi-contiguous, fuzzy nature of the distribution feels consistent with read skews being propagated back into write transactions, causing the value to drift over time. I'm not sure what to make of the steady trends down and up, though--this doesn't exactly look like a random walk to me. I wonder if perhaps a single account which has a chance of going missing, or appearing as an extra account, in some reads might cause a linear drift in the mean value over time?

aphyr commented Apr 10, 2018

I've collected a bit more evidence here. These plots show the observed sum of accounts over time, and colors represent the different nodes being queried. In a snapshot isolated system, the numbers should always be n (100). In some runs, the value periodically jumps to n+d, then returns to n for a time, then goes back to n+d, and so on--this feels like it could be a read skew anomaly that wasn't promoted via a write. In this test, the anomalous total was the same each time, and visible across multiple nodes. I'm not sure what to make of this.

bank

Another plot looks like this:

bank

Here, we see a period where accounts drift lower, then they start drifting higher again. The semi-contiguous, fuzzy nature of the distribution feels consistent with read skews being propagated back into write transactions, causing the value to drift over time. I'm not sure what to make of the steady trends down and up, though--this doesn't exactly look like a random walk to me. I wonder if perhaps a single account which has a chance of going missing, or appearing as an extra account, in some reads might cause a linear drift in the mean value over time?

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 10, 2018

I've managed to narrow this down to crashes with just alpha--it doesn't require a zero crash to reproduce.

aphyr commented Apr 10, 2018

I've managed to narrow this down to crashes with just alpha--it doesn't require a zero crash to reproduce.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 11, 2018

OK, we can also reproduce this problem with no crashes and just network partitions, and it turns out there's a weird thing occurring I didn't think to check for--some reads include objects with no key! Take a look at this: 20180410T175145.000-0500.zip

      :value {nil 3, 0 1, 1 15, 2 18, 3 5, 4 9, 5 17, 6 23, 7 12},

We have a value for all 8 keys here, plus one we didn't expect: a blank key with the value 3. The 8 normal keys sum to 100, but the blank key raises the total to 103. This looks like it might be related to the blank records we got for queries in #2290?

I think this might explain the scenario where the total value jumps for some time to a fixed value, then returns to normal--that might be a duration where that blank-key record is visible, but if nothing uses it for transfers, won't interfere with the overall total.

aphyr commented Apr 11, 2018

OK, we can also reproduce this problem with no crashes and just network partitions, and it turns out there's a weird thing occurring I didn't think to check for--some reads include objects with no key! Take a look at this: 20180410T175145.000-0500.zip

      :value {nil 3, 0 1, 1 15, 2 18, 3 5, 4 9, 5 17, 6 23, 7 12},

We have a value for all 8 keys here, plus one we didn't expect: a blank key with the value 3. The 8 normal keys sum to 100, but the blank key raises the total to 103. This looks like it might be related to the blank records we got for queries in #2290?

I think this might explain the scenario where the total value jumps for some time to a fixed value, then returns to normal--that might be a duration where that blank-key record is visible, but if nothing uses it for transfers, won't interfere with the overall total.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 11, 2018

OK, cool. Dgraph will actually do this in healthy clusters, you just have to wait long enough. Here's a history with no crashes or partitions at all, where the sum of accounts is fine, fine, fine, then fluctuates wildly, dropping to less than 10% of its expected value at one point. bank

20180410T223952.000-0500.zip

You can reproduce this with Jepsen 30c37d64f92ad2864e27ed6b818bf33e633a94be:

lein run test --package-url https://transfer.sh/Z5CTJ/dgraph-linux-amd64.tar.gz --time-limit 600 --concurrency 2n --nemesis none --test-count 20 --workload bank --upsert-schema

aphyr commented Apr 11, 2018

OK, cool. Dgraph will actually do this in healthy clusters, you just have to wait long enough. Here's a history with no crashes or partitions at all, where the sum of accounts is fine, fine, fine, then fluctuates wildly, dropping to less than 10% of its expected value at one point. bank

20180410T223952.000-0500.zip

You can reproduce this with Jepsen 30c37d64f92ad2864e27ed6b818bf33e633a94be:

lein run test --package-url https://transfer.sh/Z5CTJ/dgraph-linux-amd64.tar.gz --time-limit 600 --concurrency 2n --nemesis none --test-count 20 --workload bank --upsert-schema
@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 11, 2018

Another run-here, we're OK until about 461 seconds in, when latency drops dramatically and we start getting a slew of errors.

bank 1

The band of yellow infos in the middle here happens because the transfer code crashes when it reads illegal records with missing values.

latency-raw 28

At that initial transition, we read a bunch of records like:

     {:type :ok,
      :f :read,
      :process 3,
      :time 460996049202,
      :value {0 nil, 1 nil, 2 4, 3 nil, 4 nil, 5 nil, 6 nil},

... for a query for all records of type "account". We never create accounts without a value (or a key, for that matter), so these are clearly illegal.

Around the same time we start seeing these nil requests, we get a handful of errors like

io.grpc.StatusRuntimeException: UNKNOWN: : rpc error: code = Unknown desc = Temporary error, attr: "amount" groupId: 2 Request sent to wrong server

and

io.grpc.StatusRuntimeException: UNKNOWN: : Tablet isn't being served by this instance.

And about thirty Predicate is being moved, please retry later errors. That makes me suspect that maybe predicate/tablet migrations are involved here? And--it's not a transient error, either. Those corrupt records show up for the rest of the test, even though the tablet migration errors only occur briefly.

aphyr commented Apr 11, 2018

Another run-here, we're OK until about 461 seconds in, when latency drops dramatically and we start getting a slew of errors.

bank 1

The band of yellow infos in the middle here happens because the transfer code crashes when it reads illegal records with missing values.

latency-raw 28

At that initial transition, we read a bunch of records like:

     {:type :ok,
      :f :read,
      :process 3,
      :time 460996049202,
      :value {0 nil, 1 nil, 2 4, 3 nil, 4 nil, 5 nil, 6 nil},

... for a query for all records of type "account". We never create accounts without a value (or a key, for that matter), so these are clearly illegal.

Around the same time we start seeing these nil requests, we get a handful of errors like

io.grpc.StatusRuntimeException: UNKNOWN: : rpc error: code = Unknown desc = Temporary error, attr: "amount" groupId: 2 Request sent to wrong server

and

io.grpc.StatusRuntimeException: UNKNOWN: : Tablet isn't being served by this instance.

And about thirty Predicate is being moved, please retry later errors. That makes me suspect that maybe predicate/tablet migrations are involved here? And--it's not a transient error, either. Those corrupt records show up for the rest of the test, even though the tablet migration errors only occur briefly.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 11, 2018

@janardhan1993 suggested that 6fb69e2 might fix an issue related to tablet migration, but unfortunately, it looks like that build also fails in similar ways. In
20180411T150012.000-0500.zip, the value is stable for 496 seconds, then fluctuates down, comes back up, and enters some sort of stable superposition where reads on all nodes are able to observe either 100 or 103:

bank 1

It looks like that initial divergence occurs at around 13:08:28, which is a few seconds after alpha initiates a predicate move:

2018/04/11 13:08:20 tablet.go:174: 

Groups sorted by size: [{gid:2 size:6289} {gid:1 size:11491}]

2018/04/11 13:08:20 tablet.go:179: size_diff 5202
2018/04/11 13:08:20 tablet.go:69: Going to move predicate: [dummy], size: [17 B] from group 1 to 2
2018/04/11 13:08:20 tablet.go:104: Predicate move done for: [dummy] from group 1 to 2

This test exhibits not only drifts in the total and long-term superposition, but nil keys (which are indexed in the schema) and nil values (which are not indexed). Out of 12602 total reads, 7238 were invalid in some way--7117 of those as incorrect totals, 119 unexpected nil keys, and 2 nil values.

{:perf
 {:latency-graph {:valid? true},
  :rate-graph {:valid? true},
  :valid? true},
 :workload
 {:SI
  {:valid? false,
   :read-count 12602,
   :error-count 7238,
   :first-error
   {:type :wrong-total,
    :total 103,
    :op
    {:type :ok,
     :f :read,
     :process 1,
     :time 496531879803,
     :value {0 4, 2 12, 3 3, 5 11, 6 52, 7 21},
     :index 21317}},
   :errors
   {:wrong-total
    {:count 7117,
     :first
     {:type :wrong-total,
      :total 103,
      :op
      {:type :ok,
       :f :read,
       :process 1,
       :time 496531879803,
       :value {0 4, 2 12, 3 3, 5 11, 6 52, 7 21},
       :index 21317}},
     :worst
     {:type :wrong-total,
      :total 85,
      :op
      {:type :ok,
       :f :read,
       :process 4,
       :time 517977188927,
       :value {0 3, 7 20, 1 23, 4 6, 6 22, 3 3, 2 2, 5 6},
       :index 22219}},
     :last
     {:type :wrong-total,
      :total 103,
      :op
      {:type :ok,
       :f :read,
       :process 4,
       :time 1199926900013,
       :value {0 11, 1 8, 2 9, 3 4, 4 35, 6 26, 7 10},
       :index 50339}},
     :lowest
     {:type :wrong-total,
      :total 85,
      :op
      {:type :ok,
       :f :read,
       :process 4,
       :time 517977188927,
       :value {0 3, 7 20, 1 23, 4 6, 6 22, 3 3, 2 2, 5 6},
       :index 22219}},
     :highest
     {:type :wrong-total,
      :total 107,
      :op
      {:type :ok,
       :f :read,
       :process 1,
       :time 577645384554,
       :value {0 7, 7 6, 1 4, 4 24, 6 2, 3 4, 2 28, 5 32},
       :index 24791}}},
    :unexpected-key
    {:count 119,
     :first
     {:type :unexpected-key,
      :unexpected (nil),
      :op
      {:type :ok,
       :f :read,
       :process 0,
       :time 500866299978,
       :value {nil 1, 0 10, 7 17, 1 10, 4 1, 6 46, 3 4, 2 9, 5 5},
       :index 21511}},
     :worst
     {:type :unexpected-key,
      :unexpected (nil),
      :op
      {:type :ok,
       :f :read,
       :process 0,
       :time 500866299978,
       :value {nil 1, 0 10, 7 17, 1 10, 4 1, 6 46, 3 4, 2 9, 5 5},
       :index 21511}},
     :last
     {:type :unexpected-key,
      :unexpected (nil),
      :op
      {:type :ok,
       :f :read,
       :process 8,
       :time 1194357287908,
       :value {nil 4, 0 9, 7 13, 1 7, 4 23, 6 18, 3 8, 2 15, 5 10},
       :index 50153}}},
    :nil-balance
    {:count 2,
     :first
     {:type :nil-balance,
      :nils {3 nil},
      :op
      {:type :ok,
       :f :read,
       :process 0,
       :time 793602903430,
       :value {0 32, 7 15, 1 16, 4 3, 6 9, 3 nil, 2 12, 5 16},
       :index 33809}},
     :worst
     {:type :nil-balance,
      :nils {3 nil},
      :op
      {:type :ok,
       :f :read,
       :process 0,
       :time 793602903430,
       :value {0 32, 7 15, 1 16, 4 3, 6 9, 3 nil, 2 12, 5 16},
       :index 33809}},
     :last
     {:type :nil-balance,
      :nils {3 nil},
      :op
      {:type :ok,
       :f :read,
       :process 0,
       :time 1056476882993,
       :value {0 10, 7 14, 1 14, 4 4, 6 52, 3 nil, 2 7, 5 2},
       :index 44476}}}}},
  :plot {:valid? true},
  :valid? false},
 :valid? false}

aphyr commented Apr 11, 2018

@janardhan1993 suggested that 6fb69e2 might fix an issue related to tablet migration, but unfortunately, it looks like that build also fails in similar ways. In
20180411T150012.000-0500.zip, the value is stable for 496 seconds, then fluctuates down, comes back up, and enters some sort of stable superposition where reads on all nodes are able to observe either 100 or 103:

bank 1

It looks like that initial divergence occurs at around 13:08:28, which is a few seconds after alpha initiates a predicate move:

2018/04/11 13:08:20 tablet.go:174: 

Groups sorted by size: [{gid:2 size:6289} {gid:1 size:11491}]

2018/04/11 13:08:20 tablet.go:179: size_diff 5202
2018/04/11 13:08:20 tablet.go:69: Going to move predicate: [dummy], size: [17 B] from group 1 to 2
2018/04/11 13:08:20 tablet.go:104: Predicate move done for: [dummy] from group 1 to 2

This test exhibits not only drifts in the total and long-term superposition, but nil keys (which are indexed in the schema) and nil values (which are not indexed). Out of 12602 total reads, 7238 were invalid in some way--7117 of those as incorrect totals, 119 unexpected nil keys, and 2 nil values.

{:perf
 {:latency-graph {:valid? true},
  :rate-graph {:valid? true},
  :valid? true},
 :workload
 {:SI
  {:valid? false,
   :read-count 12602,
   :error-count 7238,
   :first-error
   {:type :wrong-total,
    :total 103,
    :op
    {:type :ok,
     :f :read,
     :process 1,
     :time 496531879803,
     :value {0 4, 2 12, 3 3, 5 11, 6 52, 7 21},
     :index 21317}},
   :errors
   {:wrong-total
    {:count 7117,
     :first
     {:type :wrong-total,
      :total 103,
      :op
      {:type :ok,
       :f :read,
       :process 1,
       :time 496531879803,
       :value {0 4, 2 12, 3 3, 5 11, 6 52, 7 21},
       :index 21317}},
     :worst
     {:type :wrong-total,
      :total 85,
      :op
      {:type :ok,
       :f :read,
       :process 4,
       :time 517977188927,
       :value {0 3, 7 20, 1 23, 4 6, 6 22, 3 3, 2 2, 5 6},
       :index 22219}},
     :last
     {:type :wrong-total,
      :total 103,
      :op
      {:type :ok,
       :f :read,
       :process 4,
       :time 1199926900013,
       :value {0 11, 1 8, 2 9, 3 4, 4 35, 6 26, 7 10},
       :index 50339}},
     :lowest
     {:type :wrong-total,
      :total 85,
      :op
      {:type :ok,
       :f :read,
       :process 4,
       :time 517977188927,
       :value {0 3, 7 20, 1 23, 4 6, 6 22, 3 3, 2 2, 5 6},
       :index 22219}},
     :highest
     {:type :wrong-total,
      :total 107,
      :op
      {:type :ok,
       :f :read,
       :process 1,
       :time 577645384554,
       :value {0 7, 7 6, 1 4, 4 24, 6 2, 3 4, 2 28, 5 32},
       :index 24791}}},
    :unexpected-key
    {:count 119,
     :first
     {:type :unexpected-key,
      :unexpected (nil),
      :op
      {:type :ok,
       :f :read,
       :process 0,
       :time 500866299978,
       :value {nil 1, 0 10, 7 17, 1 10, 4 1, 6 46, 3 4, 2 9, 5 5},
       :index 21511}},
     :worst
     {:type :unexpected-key,
      :unexpected (nil),
      :op
      {:type :ok,
       :f :read,
       :process 0,
       :time 500866299978,
       :value {nil 1, 0 10, 7 17, 1 10, 4 1, 6 46, 3 4, 2 9, 5 5},
       :index 21511}},
     :last
     {:type :unexpected-key,
      :unexpected (nil),
      :op
      {:type :ok,
       :f :read,
       :process 8,
       :time 1194357287908,
       :value {nil 4, 0 9, 7 13, 1 7, 4 23, 6 18, 3 8, 2 15, 5 10},
       :index 50153}}},
    :nil-balance
    {:count 2,
     :first
     {:type :nil-balance,
      :nils {3 nil},
      :op
      {:type :ok,
       :f :read,
       :process 0,
       :time 793602903430,
       :value {0 32, 7 15, 1 16, 4 3, 6 9, 3 nil, 2 12, 5 16},
       :index 33809}},
     :worst
     {:type :nil-balance,
      :nils {3 nil},
      :op
      {:type :ok,
       :f :read,
       :process 0,
       :time 793602903430,
       :value {0 32, 7 15, 1 16, 4 3, 6 9, 3 nil, 2 12, 5 16},
       :index 33809}},
     :last
     {:type :nil-balance,
      :nils {3 nil},
      :op
      {:type :ok,
       :f :read,
       :process 0,
       :time 1056476882993,
       :value {0 10, 7 14, 1 14, 4 4, 6 52, 3 nil, 2 7, 5 2},
       :index 44476}}}}},
  :plot {:valid? true},
  :valid? false},
 :valid? false}
@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 12, 2018

This problem looks like it requires high throughput or contention--1 req/sec/client isn't enough to reproduce it, at least in hour-long tests, even though predicates do move during that time. hat suggests it might be a race between a client request and a tablet being moved, as opposed to the migration process introducing errors on its own.

aphyr commented Apr 12, 2018

This problem looks like it requires high throughput or contention--1 req/sec/client isn't enough to reproduce it, at least in hour-long tests, even though predicates do move during that time. hat suggests it might be a race between a client request and a tablet being moved, as opposed to the migration process introducing errors on its own.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 16, 2018

I can also confirm these issues with the latest nightly: 2018-04-06 13:20:06 +1000 8333f72.

aphyr commented Apr 16, 2018

I can also confirm these issues with the latest nightly: 2018-04-06 13:20:06 +1000 8333f72.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 16, 2018

Nothing particularly new here, but this plot suggest the characteristic behaviors we've identified in account sums (sloping, jumping between two values, and fuzzily bouncing all over the place) are separable phenomena, although each is associated with a tablet move. 20180416T111943.000-0500.zip

bank

aphyr commented Apr 16, 2018

Nothing particularly new here, but this plot suggest the characteristic behaviors we've identified in account sums (sloping, jumping between two values, and fuzzily bouncing all over the place) are separable phenomena, although each is associated with a tablet move. 20180416T111943.000-0500.zip

bank

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 17, 2018

You can now reproduce these problems significantly faster in Jepsen eb796cfcc204c592545965968bd28ad1e6b2eff0 by using the move-tablet nemesis, which shuffles tablets around every 15 seconds or so.

lein run test --package-url https://github.com/dgraph-io/dgraph/releases/download/nightly/dgraph-linux-amd64.tar.gz --time-limit 60 --concurrency 1n --nemesis move-tablet --workload bank --upsert-schema

aphyr commented Apr 17, 2018

You can now reproduce these problems significantly faster in Jepsen eb796cfcc204c592545965968bd28ad1e6b2eff0 by using the move-tablet nemesis, which shuffles tablets around every 15 seconds or so.

lein run test --package-url https://github.com/dgraph-io/dgraph/releases/download/nightly/dgraph-linux-amd64.tar.gz --time-limit 60 --concurrency 1n --nemesis move-tablet --workload bank --upsert-schema
@janardhan1993

This comment has been minimized.

Show comment
Hide comment
@janardhan1993

janardhan1993 Apr 17, 2018

Contributor

Some of the issue are fixed in #2339

Contributor

janardhan1993 commented Apr 17, 2018

Some of the issue are fixed in #2339

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 17, 2018

This is significantly improved in 95f3382, but still allows nil keys and incorrect totals.

aphyr commented Apr 17, 2018

This is significantly improved in 95f3382, but still allows nil keys and incorrect totals.

@manishrjain

This comment has been minimized.

Show comment
Hide comment
@manishrjain

manishrjain May 16, 2018

Member

This might also be fixed by #2392 . Testing it.

Member

manishrjain commented May 16, 2018

This might also be fixed by #2392 . Testing it.

@manishrjain

This comment has been minimized.

Show comment
Hide comment
@manishrjain

manishrjain May 17, 2018

Member

I have to run the test for half an hour before this happens, but I see a nil key being added. Not sure if this is due to predicate move, or something else.

 {:valid? false,
   :read-count 21882,
   :error-count 1427,
   :first-error
   {:type :unexpected-key,
    :unexpected (nil),
    :op
    {:type :ok,
     :f :read,
     :process 52,
     :time 1153593764746,
     :value {nil 2, 0 23, 1 18, 2 17, 3 2, 5 24, 6 6, 7 10},
     :index 58951}},
Member

manishrjain commented May 17, 2018

I have to run the test for half an hour before this happens, but I see a nil key being added. Not sure if this is due to predicate move, or something else.

 {:valid? false,
   :read-count 21882,
   :error-count 1427,
   :first-error
   {:type :unexpected-key,
    :unexpected (nil),
    :op
    {:type :ok,
     :f :read,
     :process 52,
     :time 1153593764746,
     :value {nil 2, 0 23, 1 18, 2 17, 3 2, 5 24, 6 6, 7 10},
     :index 58951}},
@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr May 17, 2018

Yeah, it looks like we're still seeing violations in the binary you sent over on 2018-05-16 (it doesn't log any version info so I'm not exactly sure what code was in it). They're less frequent now! It took a couple hours before I hit this case: 20180516T194304.000-0500.zip

In this case, we never observed nil keys, but it looks like we abruptly lost about half the keys in the database--keys 2, 3, 4, 5, and 7 vanished.

{:type :invoke, :f :read, :process 7, :time 80842565886, :index 2017}
{:type :ok, :f :read, :process 5, :time 80997241014, :value {0 31, 1 15, 2 15, 3 7, 4 16, 5 2, 6 8, 7 6}, :index 2018}
{:type :invoke, :f :transfer, :value {:from 2, :to 1, :amount 3}, :process 5, :time 80997495741, :index 2019}
{:type :ok, :f :read, :process 7, :time 81175273547, :value {0 31, 1 15, 5 2, 6 8}, :index 2020}

It looks like these errors are persistent--they affect all nodes and are semi-stable over time.

bank 3

aphyr commented May 17, 2018

Yeah, it looks like we're still seeing violations in the binary you sent over on 2018-05-16 (it doesn't log any version info so I'm not exactly sure what code was in it). They're less frequent now! It took a couple hours before I hit this case: 20180516T194304.000-0500.zip

In this case, we never observed nil keys, but it looks like we abruptly lost about half the keys in the database--keys 2, 3, 4, 5, and 7 vanished.

{:type :invoke, :f :read, :process 7, :time 80842565886, :index 2017}
{:type :ok, :f :read, :process 5, :time 80997241014, :value {0 31, 1 15, 2 15, 3 7, 4 16, 5 2, 6 8, 7 6}, :index 2018}
{:type :invoke, :f :transfer, :value {:from 2, :to 1, :amount 3}, :process 5, :time 80997495741, :index 2019}
{:type :ok, :f :read, :process 7, :time 81175273547, :value {0 31, 1 15, 5 2, 6 8}, :index 2020}

It looks like these errors are persistent--they affect all nodes and are semi-stable over time.

bank 3

@manishrjain

This comment has been minimized.

Show comment
Hide comment
@manishrjain

manishrjain May 18, 2018

Member

Testing this right now, let's see.

Member

manishrjain commented May 18, 2018

Testing this right now, let's see.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr May 22, 2018

Testing the build you sent on Friday, we're still seeing divergence every ~5000 seconds with predicate moves. In 20180522T110600.000-0500.zip, the value diverged in what sort of looks like a split-brain scenario:

bank 4

You can reproduce this on Jepsen 789103bfd25f8635ed3410a38237003b0ae45da1 with

lein run test --local-binary ~/tmp/dgraph --force-download --nemesis move-tablet --sequencing server --upsert-schema --time-limit 300 --concurrency 1n --nemesis-interval 5 --workload bank --retry-db-setup --test-count 10

aphyr commented May 22, 2018

Testing the build you sent on Friday, we're still seeing divergence every ~5000 seconds with predicate moves. In 20180522T110600.000-0500.zip, the value diverged in what sort of looks like a split-brain scenario:

bank 4

You can reproduce this on Jepsen 789103bfd25f8635ed3410a38237003b0ae45da1 with

lein run test --local-binary ~/tmp/dgraph --force-download --nemesis move-tablet --sequencing server --upsert-schema --time-limit 300 --concurrency 1n --nemesis-interval 5 --workload bank --retry-db-setup --test-count 10

@aphyr aphyr changed the title from SI violation with node crashes in bank tests to SI violation with node crashes or predicate moves in bank tests May 22, 2018

@manishrjain manishrjain removed the verify-fix label May 23, 2018

@manishrjain

This comment has been minimized.

Show comment
Hide comment
@manishrjain

manishrjain May 23, 2018

Member

I'm trying out bank tests with predicate moves disabled. Want to determine if this problem is only due to moves, or it happens without the moves as well.

Member

manishrjain commented May 23, 2018

I'm trying out bank tests with predicate moves disabled. Want to determine if this problem is only due to moves, or it happens without the moves as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment