Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SI violation with node crashes or predicate moves in bank tests #2321

Closed
aphyr opened this issue Apr 10, 2018 · 20 comments
Closed

SI violation with node crashes or predicate moves in bank tests #2321

aphyr opened this issue Apr 10, 2018 · 20 comments
Assignees
Labels

Comments

@aphyr
Copy link

@aphyr aphyr commented Apr 10, 2018

In healthy clusters, with @upsert schema directives on key and type attributes (but not on values), and server-side ordering set for all client transactions, version

Dgraph version   : v1.0.4
Commit SHA-1     : 6fb69e2
Commit timestamp : 2018-04-09 21:26:31 +0530
Branch           : jan/node_lockup

can exhibit a violation of snapshot isolation, allowing account values to drift higher or lower over time. This may be related to #2143 (the previous bank test failure without any nemesis, now closed), but I suspect this issue miiight be different than #2290, where documents are lost or have nil values instead of their correct data, because that would cause an undercount, and in bank tests, I can also observe overcounting. For instance, in 20180409T163239.000-0500.zip:

9	:invoke	:transfer	{:from 1, :to 3, :amount 2}
8	:ok	:read	{0 43, 1 4, 2 3, 3 3, 4 10, 5 7, 6 27, 7 3}
8	:invoke	:transfer	{:from 2, :to 6, :amount 4}
6	:fail	:transfer	{:from 7, :to 4, :amount 1}	:conflict
6	:invoke	:read	nil
3	:fail	:transfer	{:from 5, :to 7, :amount 1}	:conflict
3	:invoke	:read	nil
4	:ok	:read	{0 43, 1 4, 2 3, 3 3, 4 10, 5 7, 6 27, 7 3}
4	:invoke	:read	nil
2	:ok	:read	{0 43, 1 4, 2 5, 3 3, 4 10, 5 7, 6 25, 7 3}
6	:ok	:read	{0 43, 1 4, 2 3, 3 4, 4 10, 5 7, 6 27, 7 2}
2	:invoke	:read	nil
6	:invoke	:read	nil
6	:ok	:read	{0 43, 1 4, 2 3, 3 4, 4 10, 5 7, 6 27, 7 2}
6	:invoke	:read	nil
4	:ok	:read	{0 43, 1 4, 2 3, 3 4, 4 10, 5 7, 6 27, 7 2}
4	:invoke	:read	nil
3	:ok	:read	{0 43, 1 4, 2 3, 3 4, 4 10, 5 7, 6 27, 7 2}
3	:invoke	:read	nil
8	:fail	:transfer	{:from 2, :to 6, :amount 4}	:insufficient-funds
8	:invoke	:transfer	{:from 4, :to 7, :amount 5}
0	:ok	:transfer	{:from 1, :to 6, :amount 2}
0	:invoke	:read	nil
6	:ok	:read	{0 43, 1 2, 2 3, 3 4, 4 10, 5 7, 6 29, 7 2}
6	:invoke	:read	nil
5	:fail	:transfer	{:from 1, :to 7, :amount 3}	:conflict
5	:invoke	:read	nil
1	:fail	:transfer	{:from 6, :to 2, :amount 2}	:conflict
1	:invoke	:transfer	{:from 6, :to 2, :amount 2}
0	:ok	:read	{0 43, 1 2, 2 3, 3 4, 4 10, 5 7, 6 29, 7 2}
3	:ok	:read	{0 43, 1 2, 2 3, 3 4, 4 10, 5 7, 6 29, 7 2}
0	:invoke	:transfer	{:from 2, :to 0, :amount 2}
3	:invoke	:transfer	{:from 7, :to 1, :amount 4}
5	:ok	:read	{0 43, 1 2, 2 3, 3 4, 4 10, 5 7, 6 29, 7 2}
4	:ok	:read	{0 43, 1 2, 2 3, 3 4, 4 10, 5 7, 6 29, 7 2}
5	:invoke	:read	nil
4	:invoke	:transfer	{:from 6, :to 5, :amount 3}
2	:ok	:read	{0 43, 1 4, 2 3, 3 4, 4 10, 5 7, 6 27, 7 2}
2	:invoke	:read	nil
5	:ok	:read	{0 43, 1 2, 2 3, 3 4, 4 10, 5 7, 6 29, 7 2}
6	:ok	:read	{0 43, 1 2, 2 3, 3 4, 4 10, 5 7, 6 29, 7 2}
5	:invoke	:read	nil
6	:invoke	:read	nil
5	:ok	:read	{0 43, 1 2, 2 3, 3 4, 4 10, 5 7, 6 29, 7 2}
5	:invoke	:transfer	{:from 1, :to 0, :amount 5}
6	:ok	:read	{0 43, 1 2, 2 3, 3 4, 4 10, 5 7, 6 29, 7 2}
6	:invoke	:transfer	{:from 1, :to 3, :amount 5}
9	:ok	:transfer	{:from 1, :to 3, :amount 2}
9	:invoke	:transfer	{:from 0, :to 2, :amount 2}
0	:ok	:transfer	{:from 2, :to 0, :amount 2}
0	:invoke	:transfer	{:from 3, :to 5, :amount 5}
5	:fail	:transfer	{:from 1, :to 0, :amount 5}	:insufficient-funds
5	:invoke	:read	nil
2	:ok	:read	{0 43, 1 2, 2 3, 3 4, 4 10, 5 7, 6 29, 7 2}
2	:invoke	:transfer	{:from 0, :to 2, :amount 2}
6	:fail	:transfer	{:from 1, :to 3, :amount 5}	:insufficient-funds
6	:invoke	:transfer	{:from 1, :to 2, :amount 5}
5	:ok	:read	{0 45, 1 2, 2 1, 3 6, 4 10, 5 7, 6 29, 7 2}

Here, the final read has a total of 102: 2 has been transferred from account 2 to account 0, but account 3 has also gained 2 out of nowhere. The only recent increment of account 3 comes from process 9's transfer of 2 from 1 to 3, but it doesn't appear to have applied atomically?

These don't appear to just be read-only anomalies; they can be promoted via writes back into the database state--the value remains 102 for the remainder of the test.

You can reproduce this with Jepsen 0ef6e711dfb07aad4afc84f7f9c3348961afa9d7 by running

lein run test --package-url https://transfer.sh/Z5CTJ/dgraph-linux-amd64.tar.gz --time-limit 300 --concurrency 2n --nemesis kill-alpha,fix-alpha,kill-zero --test-count 20 --workload bank --upsert-schema
@aphyr
Copy link
Author

@aphyr aphyr commented Apr 10, 2018

I've collected a bit more evidence here. These plots show the observed sum of accounts over time, and colors represent the different nodes being queried. In a snapshot isolated system, the numbers should always be n (100). In some runs, the value periodically jumps to n+d, then returns to n for a time, then goes back to n+d, and so on--this feels like it could be a read skew anomaly that wasn't promoted via a write. In this test, the anomalous total was the same each time, and visible across multiple nodes. I'm not sure what to make of this.

bank

Another plot looks like this:

bank

Here, we see a period where accounts drift lower, then they start drifting higher again. The semi-contiguous, fuzzy nature of the distribution feels consistent with read skews being propagated back into write transactions, causing the value to drift over time. I'm not sure what to make of the steady trends down and up, though--this doesn't exactly look like a random walk to me. I wonder if perhaps a single account which has a chance of going missing, or appearing as an extra account, in some reads might cause a linear drift in the mean value over time?

@aphyr
Copy link
Author

@aphyr aphyr commented Apr 10, 2018

I've managed to narrow this down to crashes with just alpha--it doesn't require a zero crash to reproduce.

@aphyr
Copy link
Author

@aphyr aphyr commented Apr 11, 2018

OK, we can also reproduce this problem with no crashes and just network partitions, and it turns out there's a weird thing occurring I didn't think to check for--some reads include objects with no key! Take a look at this: 20180410T175145.000-0500.zip

      :value {nil 3, 0 1, 1 15, 2 18, 3 5, 4 9, 5 17, 6 23, 7 12},

We have a value for all 8 keys here, plus one we didn't expect: a blank key with the value 3. The 8 normal keys sum to 100, but the blank key raises the total to 103. This looks like it might be related to the blank records we got for queries in #2290?

I think this might explain the scenario where the total value jumps for some time to a fixed value, then returns to normal--that might be a duration where that blank-key record is visible, but if nothing uses it for transfers, won't interfere with the overall total.

@aphyr
Copy link
Author

@aphyr aphyr commented Apr 11, 2018

OK, cool. Dgraph will actually do this in healthy clusters, you just have to wait long enough. Here's a history with no crashes or partitions at all, where the sum of accounts is fine, fine, fine, then fluctuates wildly, dropping to less than 10% of its expected value at one point. bank

20180410T223952.000-0500.zip

You can reproduce this with Jepsen 30c37d64f92ad2864e27ed6b818bf33e633a94be:

lein run test --package-url https://transfer.sh/Z5CTJ/dgraph-linux-amd64.tar.gz --time-limit 600 --concurrency 2n --nemesis none --test-count 20 --workload bank --upsert-schema
@aphyr
Copy link
Author

@aphyr aphyr commented Apr 11, 2018

Another run-here, we're OK until about 461 seconds in, when latency drops dramatically and we start getting a slew of errors.

bank 1

The band of yellow infos in the middle here happens because the transfer code crashes when it reads illegal records with missing values.

latency-raw 28

At that initial transition, we read a bunch of records like:

     {:type :ok,
      :f :read,
      :process 3,
      :time 460996049202,
      :value {0 nil, 1 nil, 2 4, 3 nil, 4 nil, 5 nil, 6 nil},

... for a query for all records of type "account". We never create accounts without a value (or a key, for that matter), so these are clearly illegal.

Around the same time we start seeing these nil requests, we get a handful of errors like

io.grpc.StatusRuntimeException: UNKNOWN: : rpc error: code = Unknown desc = Temporary error, attr: "amount" groupId: 2 Request sent to wrong server

and

io.grpc.StatusRuntimeException: UNKNOWN: : Tablet isn't being served by this instance.

And about thirty Predicate is being moved, please retry later errors. That makes me suspect that maybe predicate/tablet migrations are involved here? And--it's not a transient error, either. Those corrupt records show up for the rest of the test, even though the tablet migration errors only occur briefly.

@aphyr
Copy link
Author

@aphyr aphyr commented Apr 11, 2018

@janardhan1993 suggested that 6fb69e2 might fix an issue related to tablet migration, but unfortunately, it looks like that build also fails in similar ways. In
20180411T150012.000-0500.zip, the value is stable for 496 seconds, then fluctuates down, comes back up, and enters some sort of stable superposition where reads on all nodes are able to observe either 100 or 103:

bank 1

It looks like that initial divergence occurs at around 13:08:28, which is a few seconds after alpha initiates a predicate move:

2018/04/11 13:08:20 tablet.go:174: 

Groups sorted by size: [{gid:2 size:6289} {gid:1 size:11491}]

2018/04/11 13:08:20 tablet.go:179: size_diff 5202
2018/04/11 13:08:20 tablet.go:69: Going to move predicate: [dummy], size: [17 B] from group 1 to 2
2018/04/11 13:08:20 tablet.go:104: Predicate move done for: [dummy] from group 1 to 2

This test exhibits not only drifts in the total and long-term superposition, but nil keys (which are indexed in the schema) and nil values (which are not indexed). Out of 12602 total reads, 7238 were invalid in some way--7117 of those as incorrect totals, 119 unexpected nil keys, and 2 nil values.

{:perf
 {:latency-graph {:valid? true},
  :rate-graph {:valid? true},
  :valid? true},
 :workload
 {:SI
  {:valid? false,
   :read-count 12602,
   :error-count 7238,
   :first-error
   {:type :wrong-total,
    :total 103,
    :op
    {:type :ok,
     :f :read,
     :process 1,
     :time 496531879803,
     :value {0 4, 2 12, 3 3, 5 11, 6 52, 7 21},
     :index 21317}},
   :errors
   {:wrong-total
    {:count 7117,
     :first
     {:type :wrong-total,
      :total 103,
      :op
      {:type :ok,
       :f :read,
       :process 1,
       :time 496531879803,
       :value {0 4, 2 12, 3 3, 5 11, 6 52, 7 21},
       :index 21317}},
     :worst
     {:type :wrong-total,
      :total 85,
      :op
      {:type :ok,
       :f :read,
       :process 4,
       :time 517977188927,
       :value {0 3, 7 20, 1 23, 4 6, 6 22, 3 3, 2 2, 5 6},
       :index 22219}},
     :last
     {:type :wrong-total,
      :total 103,
      :op
      {:type :ok,
       :f :read,
       :process 4,
       :time 1199926900013,
       :value {0 11, 1 8, 2 9, 3 4, 4 35, 6 26, 7 10},
       :index 50339}},
     :lowest
     {:type :wrong-total,
      :total 85,
      :op
      {:type :ok,
       :f :read,
       :process 4,
       :time 517977188927,
       :value {0 3, 7 20, 1 23, 4 6, 6 22, 3 3, 2 2, 5 6},
       :index 22219}},
     :highest
     {:type :wrong-total,
      :total 107,
      :op
      {:type :ok,
       :f :read,
       :process 1,
       :time 577645384554,
       :value {0 7, 7 6, 1 4, 4 24, 6 2, 3 4, 2 28, 5 32},
       :index 24791}}},
    :unexpected-key
    {:count 119,
     :first
     {:type :unexpected-key,
      :unexpected (nil),
      :op
      {:type :ok,
       :f :read,
       :process 0,
       :time 500866299978,
       :value {nil 1, 0 10, 7 17, 1 10, 4 1, 6 46, 3 4, 2 9, 5 5},
       :index 21511}},
     :worst
     {:type :unexpected-key,
      :unexpected (nil),
      :op
      {:type :ok,
       :f :read,
       :process 0,
       :time 500866299978,
       :value {nil 1, 0 10, 7 17, 1 10, 4 1, 6 46, 3 4, 2 9, 5 5},
       :index 21511}},
     :last
     {:type :unexpected-key,
      :unexpected (nil),
      :op
      {:type :ok,
       :f :read,
       :process 8,
       :time 1194357287908,
       :value {nil 4, 0 9, 7 13, 1 7, 4 23, 6 18, 3 8, 2 15, 5 10},
       :index 50153}}},
    :nil-balance
    {:count 2,
     :first
     {:type :nil-balance,
      :nils {3 nil},
      :op
      {:type :ok,
       :f :read,
       :process 0,
       :time 793602903430,
       :value {0 32, 7 15, 1 16, 4 3, 6 9, 3 nil, 2 12, 5 16},
       :index 33809}},
     :worst
     {:type :nil-balance,
      :nils {3 nil},
      :op
      {:type :ok,
       :f :read,
       :process 0,
       :time 793602903430,
       :value {0 32, 7 15, 1 16, 4 3, 6 9, 3 nil, 2 12, 5 16},
       :index 33809}},
     :last
     {:type :nil-balance,
      :nils {3 nil},
      :op
      {:type :ok,
       :f :read,
       :process 0,
       :time 1056476882993,
       :value {0 10, 7 14, 1 14, 4 4, 6 52, 3 nil, 2 7, 5 2},
       :index 44476}}}}},
  :plot {:valid? true},
  :valid? false},
 :valid? false}
@aphyr
Copy link
Author

@aphyr aphyr commented Apr 12, 2018

This problem looks like it requires high throughput or contention--1 req/sec/client isn't enough to reproduce it, at least in hour-long tests, even though predicates do move during that time. hat suggests it might be a race between a client request and a tablet being moved, as opposed to the migration process introducing errors on its own.

@aphyr
Copy link
Author

@aphyr aphyr commented Apr 16, 2018

I can also confirm these issues with the latest nightly: 2018-04-06 13:20:06 +1000 8333f72.

@aphyr
Copy link
Author

@aphyr aphyr commented Apr 16, 2018

Nothing particularly new here, but this plot suggest the characteristic behaviors we've identified in account sums (sloping, jumping between two values, and fuzzily bouncing all over the place) are separable phenomena, although each is associated with a tablet move. 20180416T111943.000-0500.zip

bank

@aphyr
Copy link
Author

@aphyr aphyr commented Apr 17, 2018

You can now reproduce these problems significantly faster in Jepsen eb796cfcc204c592545965968bd28ad1e6b2eff0 by using the move-tablet nemesis, which shuffles tablets around every 15 seconds or so.

lein run test --package-url https://github.com/dgraph-io/dgraph/releases/download/nightly/dgraph-linux-amd64.tar.gz --time-limit 60 --concurrency 1n --nemesis move-tablet --workload bank --upsert-schema
@janardhan1993
Copy link
Contributor

@janardhan1993 janardhan1993 commented Apr 17, 2018

Some of the issue are fixed in #2339

@aphyr
Copy link
Author

@aphyr aphyr commented Apr 17, 2018

This is significantly improved in 95f3382, but still allows nil keys and incorrect totals.

@manishrjain
Copy link
Member

@manishrjain manishrjain commented May 16, 2018

This might also be fixed by #2392 . Testing it.

@manishrjain
Copy link
Member

@manishrjain manishrjain commented May 17, 2018

I have to run the test for half an hour before this happens, but I see a nil key being added. Not sure if this is due to predicate move, or something else.

 {:valid? false,
   :read-count 21882,
   :error-count 1427,
   :first-error
   {:type :unexpected-key,
    :unexpected (nil),
    :op
    {:type :ok,
     :f :read,
     :process 52,
     :time 1153593764746,
     :value {nil 2, 0 23, 1 18, 2 17, 3 2, 5 24, 6 6, 7 10},
     :index 58951}},
@aphyr
Copy link
Author

@aphyr aphyr commented May 17, 2018

Yeah, it looks like we're still seeing violations in the binary you sent over on 2018-05-16 (it doesn't log any version info so I'm not exactly sure what code was in it). They're less frequent now! It took a couple hours before I hit this case: 20180516T194304.000-0500.zip

In this case, we never observed nil keys, but it looks like we abruptly lost about half the keys in the database--keys 2, 3, 4, 5, and 7 vanished.

{:type :invoke, :f :read, :process 7, :time 80842565886, :index 2017}
{:type :ok, :f :read, :process 5, :time 80997241014, :value {0 31, 1 15, 2 15, 3 7, 4 16, 5 2, 6 8, 7 6}, :index 2018}
{:type :invoke, :f :transfer, :value {:from 2, :to 1, :amount 3}, :process 5, :time 80997495741, :index 2019}
{:type :ok, :f :read, :process 7, :time 81175273547, :value {0 31, 1 15, 5 2, 6 8}, :index 2020}

It looks like these errors are persistent--they affect all nodes and are semi-stable over time.

bank 3

@manishrjain
Copy link
Member

@manishrjain manishrjain commented May 18, 2018

Testing this right now, let's see.

@aphyr
Copy link
Author

@aphyr aphyr commented May 22, 2018

Testing the build you sent on Friday, we're still seeing divergence every ~5000 seconds with predicate moves. In 20180522T110600.000-0500.zip, the value diverged in what sort of looks like a split-brain scenario:

bank 4

You can reproduce this on Jepsen 789103bfd25f8635ed3410a38237003b0ae45da1 with

lein run test --local-binary ~/tmp/dgraph --force-download --nemesis move-tablet --sequencing server --upsert-schema --time-limit 300 --concurrency 1n --nemesis-interval 5 --workload bank --retry-db-setup --test-count 10
@aphyr aphyr changed the title SI violation with node crashes in bank tests SI violation with node crashes or predicate moves in bank tests May 22, 2018
@manishrjain manishrjain removed the verify-fix label May 23, 2018
@manishrjain
Copy link
Member

@manishrjain manishrjain commented May 23, 2018

I'm trying out bank tests with predicate moves disabled. Want to determine if this problem is only due to moves, or it happens without the moves as well.

@manishrjain
Copy link
Member

@manishrjain manishrjain commented Jan 18, 2019

Alright. Long time! So, I've fixed all the txn violations in a cluster, with tablet moves not happening. That's already done.

With the tablet move, I've simplified the design that we had in place before significantly, with this change: 63f8174

With this change, say a tablet moves from group X to group Y, and the write for a tablet P was made on the older group X. When commit goes to Zero, it would assert that P belongs to Y, and that the write should have been to Y; and reject that transaction. This ensures that during tablet moves, any write transaction would provide ACID guarantees.

However, this mechanism can still let a few reads go through to the older group X, which is holding stale data, hence causing some failures in Jepsen reads.

The reason is that membership information of tablets and the groups serving them, flows via Zero leader to the Alphas. This flow isn't being deterministically transmitted via Raft, instead, it happens on a continuous basis, via direct connections to Zero. Therefore, it is possible that some Alpha could be behind in the knowledge that a certain tablet has been moved from X to Y group. In that case, this Alpha could still serve some reads directly. But, any writes on that Alpha would be denied commit by Zero, as described above.

These few spurious reads are what we see in the graphs below. But, apart from a few stale reads, the total does not permanently deviate from the intended 100.

bank0
bank1
bank2
bank4

Any more checks to avoid such a read would only come at the cost of performance, without major benefits. Tablet moves are meant to be rare, not as commonly as Jepsen test runs them; so I think the mechanisms that we have in place are sufficient for almost all users.

And if there are certain use cases which need these reads to be rejected, we could consider putting something harder in place, at the cost of performance. But, for now, I think this is good.

manishrjain added a commit that referenced this issue Feb 1, 2019
This PR introduces group checksums, which get propagated along the OracleDeltas streaming from Zero -> Alpha Leader -> Alpha Followers. This allows an Alpha to block a read if the group checksum does not match the checksum in membership, which is being received independently directly from Zero.

This fixes the Jepsen issue where during a tablet move, an Alpha with an older membership status (thinking it is serving the group) can reply to a query with stale data. Now, the Alpha would see that the checksum is different between Oracle Deltas and Membership state, and would block until it gets an updated state. Thus, it would realize that it is no longer serving the tablet, and either return an error or shoot the query off to the right Alpha.

Proposal retry issue:

This PR also fixes another issue where a commit proposal to Zero Leader gets blocked when leader goes under partition but before it steps down. When the proposal times out and is internally retried, that Zero is no longer the leader, causing `n.AmLeader()` to fail and the commit to get rejected by Zero. This moves the needle of MaxAssigned, allowing a read to happen at Alpha. But, the proposal which had timed out, does get applied after -- causing a commit to happen at a lower timestamp than MaxAssigned.

This PR moves the `n.AmLeader` check to the top of the function, so when we are in the proposal try loop, we don't do this check. Instead, we'll keep trying to propose until we have a resolution from Raft.

Debugging:

This PR updates the `dgraph debug` tool to sum up the amounts, without looking at the keys in Jepsen. This is useful when keys are located separately from amounts in two different groups. This change is what allowed me to find the first txn violation, and the causes for tablet move nemesis.

Fixes #2321 .

Commits:
* Introduce Group checksums, so an Alpha can know if the composition of the tablets in group changed.
* Add a simple waiting loop for group checksums.
* Remove any background cleaning jobs.
* Always iterate to create a posting list.
* Have a way to find violation linearly.
* Make reading amounts and total without considering keys.
* Avoid a race cond between proposing a commit in a loop v/s rejecting that commit when Zero stops being a leader.
* Self review. Do not do any tablet removals or deletions for now.
@manishrjain
Copy link
Member

@manishrjain manishrjain commented Feb 1, 2019

Solved!

However, this mechanism can still let a few reads go through to the older group X, which is holding stale data, hence causing some failures in Jepsen reads.

I introduced group checksums, which get propagated along with txn status information. Using that, any Alpha would now know if the membership has changed at Zero, but not yet propagated to itself yet, and would block replying until it does. This and some other rare edge case issues are resolved in my PR above. This ticks off the official last major Jepsen issue.

Started looking at these Jepsen issues in May 2018, not full time of course, but on and off. Needed an almost redesign of the entire flow of transactions and data in Dgraph. What a journey! This PR largely concludes Dgraph v1.0 series (future versions would be v1.1).

2019-02-01-145044_1011x764_scrot

manishrjain added a commit that referenced this issue Feb 1, 2019
This PR introduces group checksums, which get propagated along the OracleDeltas streaming from Zero -> Alpha Leader -> Alpha Followers. This allows an Alpha to block a read if the group checksum does not match the checksum in membership, which is being received independently directly from Zero.

This fixes the Jepsen issue where during a tablet move, an Alpha with an older membership status (thinking it is serving the group) can reply to a query with stale data. Now, the Alpha would see that the checksum is different between Oracle Deltas and Membership state, and would block until it gets an updated state. Thus, it would realize that it is no longer serving the tablet, and either return an error or shoot the query off to the right Alpha.

Proposal retry issue:

This PR also fixes another issue where a commit proposal to Zero Leader gets blocked when leader goes under partition but before it steps down. When the proposal times out and is internally retried, that Zero is no longer the leader, causing `n.AmLeader()` to fail and the commit to get rejected by Zero. This moves the needle of MaxAssigned, allowing a read to happen at Alpha. But, the proposal which had timed out, does get applied after -- causing a commit to happen at a lower timestamp than MaxAssigned.

This PR moves the `n.AmLeader` check to the top of the function, so when we are in the proposal try loop, we don't do this check. Instead, we'll keep trying to propose until we have a resolution from Raft.

Debugging:

This PR updates the `dgraph debug` tool to sum up the amounts, without looking at the keys in Jepsen. This is useful when keys are located separately from amounts in two different groups. This change is what allowed me to find the first txn violation, and the causes for tablet move nemesis.

Fixes #2321 .

Commits:
* Introduce Group checksums, so an Alpha can know if the composition of the tablets in group changed.
* Add a simple waiting loop for group checksums.
* Remove any background cleaning jobs.
* Always iterate to create a posting list.
* Have a way to find violation linearly.
* Make reading amounts and total without considering keys.
* Avoid a race cond between proposing a commit in a loop v/s rejecting that commit when Zero stops being a leader.
* Self review. Do not do any tablet removals or deletions for now.
dna2github added a commit to dna2fork/dgraph that referenced this issue Jul 19, 2019
This PR introduces group checksums, which get propagated along the OracleDeltas streaming from Zero -> Alpha Leader -> Alpha Followers. This allows an Alpha to block a read if the group checksum does not match the checksum in membership, which is being received independently directly from Zero.

This fixes the Jepsen issue where during a tablet move, an Alpha with an older membership status (thinking it is serving the group) can reply to a query with stale data. Now, the Alpha would see that the checksum is different between Oracle Deltas and Membership state, and would block until it gets an updated state. Thus, it would realize that it is no longer serving the tablet, and either return an error or shoot the query off to the right Alpha.

Proposal retry issue:

This PR also fixes another issue where a commit proposal to Zero Leader gets blocked when leader goes under partition but before it steps down. When the proposal times out and is internally retried, that Zero is no longer the leader, causing `n.AmLeader()` to fail and the commit to get rejected by Zero. This moves the needle of MaxAssigned, allowing a read to happen at Alpha. But, the proposal which had timed out, does get applied after -- causing a commit to happen at a lower timestamp than MaxAssigned.

This PR moves the `n.AmLeader` check to the top of the function, so when we are in the proposal try loop, we don't do this check. Instead, we'll keep trying to propose until we have a resolution from Raft.

Debugging:

This PR updates the `dgraph debug` tool to sum up the amounts, without looking at the keys in Jepsen. This is useful when keys are located separately from amounts in two different groups. This change is what allowed me to find the first txn violation, and the causes for tablet move nemesis.

Fixes dgraph-io#2321 .

Commits:
* Introduce Group checksums, so an Alpha can know if the composition of the tablets in group changed.
* Add a simple waiting loop for group checksums.
* Remove any background cleaning jobs.
* Always iterate to create a posting list.
* Have a way to find violation linearly.
* Make reading amounts and total without considering keys.
* Avoid a race cond between proposing a commit in a loop v/s rejecting that commit when Zero stops being a leader.
* Self review. Do not do any tablet removals or deletions for now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

4 participants
You can’t perform that action at this time.