New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node crashes can lead to the loss of inserted triples #2290

Closed
aphyr opened this Issue Mar 31, 2018 · 13 comments

Comments

Projects
None yet
3 participants
@aphyr

aphyr commented Mar 31, 2018

In a five node cluster, with a mix of crashes and restarts of both zero and alpha nodes, it appears that the crash of multiple Alpha nodes can cause the loss of successfully inserted documents just prior to the crash. Moreover, nodes can disagree on the records that should be returned for a query for all records of a given type, and some nodes can have objects which match the type predicate, but have a missing value for the value field. This is on version

Dgraph version   : v1.0.4
Commit SHA-1     : 224b560
Commit timestamp : 2018-03-27 16:56:17 +0530
Branch           : fix/jepsen_delete

Our schema is

type: string @index(exact) .
value: int @index(int) .

We perform single mutations in transactions, inserting objects with type "element" and a unique integer value. In this test run, we lose three acknowledged inserts when Alpha crashes: 20180331T020118.000-0500.zip

From Jepsen's log just prior to the crash (read "process 8 invoked an add of element 641"...):

8	:invoke	:add	641
2	:ok	:add	632
0	:ok	:add	635
6	:ok	:add	636
2	:invoke	:add	642
7	:ok	:add	634
5	:ok	:add	638
4	:ok	:add	637
3	:invoke	:add	643
6	:invoke	:add	644
1	:ok	:add	639
7	:invoke	:add	645
0	:invoke	:add	646
5	:invoke	:add	647
4	:invoke	:add	648
8	:ok	:add	641
7	:fail	:add	645	:unavailable
7	:invoke	:add	649
1	:invoke	:add	650
3	:ok	:add	643
6	:ok	:add	644
8	:invoke	:add	651
3	:invoke	:add	652
6	:invoke	:add	653
2	:fail	:add	642	:unavailable
:nemesis	:info	:kill-alpha	{"n3" :stopped, "n5" :stopped, "n1" :stopped, "n2" :stopped}
...
336	:ok	:read	(nil nil nil nil nil 0 1 2 3 ... 633 634 635 636 637 638 639 1620 1703 ...)
222	:ok	:read	(0 1 2 3 ... 635 636 637 638 639 1620 1703 ...)
331	:ok	:read	(nil nil nil nil nil 0 1 2 3  ... 633 634 635 636 637 638 639 1620 1703 )
70	:ok	:read	(0 1 2 3 ... 637 638 639 1620 1703 ...)
338	:ok	:read	(0 1 2 3 ... 635 636 637 638 639 1620 1703 ...)
333	:ok	:read	(0 1 2 3 ... 635 636 637 638 639 1620 1703 ...)
...
 {:valid? false,
  :lost "#{641 643..644}",
  :recovered "#{2404..2408}",
  :ok
  "#{0..639 1620 1703 1728..1737 1739..1751 1753..1760 1762..1769 1771..1787 1789..1828 1830..2408 3618 3625 3631..3638 3640..3655 3657..3664 3666..3675 3678..3683 3685..3698 3700..3707 3709..3713 3716..3744 3747 3749..3768 3770..3791 3793..3808 3810 3812..3847 3849..3857 3859..3880 3882..3896 3898..3907}",
  :recovered-frac 1/1008,
  :unexpected-frac 0,
  :unexpected "#{}",
  :lost-frac 1/1680,
  :ok-frac 5/16}

Note that there are three missing inserts (641, 643, 644), which occurred just prior to the crash of n1, n2, n3, and n5. There are also five nil values that matched our query for all records matching eq(type "element"), which suggests that perhaps UIDs were created for these objects (and perhaps two others which timed out or failed?), but those UIDs weren't linked to the :value fields correctly?

Note that process numbers (the first column) are striped over nodes, so processes 0, 5, 10, ... talk to node n1. From this we can infer also odd that those value-less objects are present on one node (n2), whereas n1, n3,and n4 don't show them! n5 did not respond to either of 2 clients for its final read.

You can reproduce this with Jepsen fd875fa9f3c248f6715e1056a504392bfa3963d3 by running:

lein run test --package-url https://transfer.sh/10WC2g/dgraph-linux-amd64.tar.gz --force-download -w set --time-limit 300 --concurrency 10 --test-count 10

@manishrjain manishrjain added the bug label Mar 31, 2018

@manishrjain manishrjain added this to the Sprint-001 milestone Mar 31, 2018

@janardhan1993

This comment has been minimized.

Show comment
Hide comment
@janardhan1993

janardhan1993 Apr 3, 2018

Contributor

@aphyr Is the client client being used after restart here ?

Contributor

janardhan1993 commented Apr 3, 2018

@aphyr Is the client client being used after restart here ?

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 3, 2018

Yes, we continue to use the same clients through restart, at least until the client gives us an indeterminate error, at which point we generate a new one.

aphyr commented Apr 3, 2018

Yes, we continue to use the same clients through restart, at least until the client gives us an indeterminate error, at which point we generate a new one.

@janardhan1993 janardhan1993 self-assigned this Apr 3, 2018

@janardhan1993

This comment has been minimized.

Show comment
Hide comment
@janardhan1993

janardhan1993 Apr 9, 2018

Contributor

This can happen when using client side sequential mode. If server crashes before the data is persisted to disk, on restart a new client can read the data before the WAL is replayed.

Contributor

janardhan1993 commented Apr 9, 2018

This can happen when using client side sequential mode. If server crashes before the data is persisted to disk, on restart a new client can read the data before the WAL is replayed.

@janardhan1993

This comment has been minimized.

Show comment
Hide comment
@janardhan1993

janardhan1993 Apr 9, 2018

Contributor

It's not a loss but you would see it eventually with client side sequential mode. If we use server side sequential mode then the read would wait till everything is replayed.

Contributor

janardhan1993 commented Apr 9, 2018

It's not a loss but you would see it eventually with client side sequential mode. If we use server side sequential mode then the read would wait till everything is replayed.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 9, 2018

Ahh, okay. Do you know how long you'd have to wait for recovery to complete?

aphyr commented Apr 9, 2018

Ahh, okay. Do you know how long you'd have to wait for recovery to complete?

@janardhan1993

This comment has been minimized.

Show comment
Hide comment
@janardhan1993

janardhan1993 Apr 9, 2018

Contributor

Depends on when snapshot happened and how many entries needs to be replayed.

Contributor

janardhan1993 commented Apr 9, 2018

Depends on when snapshot happened and how many entries needs to be replayed.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 9, 2018

It looks it might be hundreds of times longer than that?

latency-raw

In this test I allow two minutes for the first recovery, but some records are still missing until a thousand seconds later, when other nodes finally come back online. And there's still that weird issue with returning records for an index query that don't have any associated value--that feels like it's definitely a consistency violation. Is that expected behavior?

aphyr commented Apr 9, 2018

It looks it might be hundreds of times longer than that?

latency-raw

In this test I allow two minutes for the first recovery, but some records are still missing until a thousand seconds later, when other nodes finally come back online. And there's still that weird issue with returning records for an index query that don't have any associated value--that feels like it's definitely a consistency violation. Is that expected behavior?

@janardhan1993 janardhan1993 reopened this Apr 9, 2018

@janardhan1993

This comment has been minimized.

Show comment
Hide comment
@janardhan1993

janardhan1993 Apr 9, 2018

Contributor

@aphyr It's not expected behaviour. Are you using latest nightly build ?

Contributor

janardhan1993 commented Apr 9, 2018

@aphyr It's not expected behaviour. Are you using latest nightly build ?

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 9, 2018

I'm looking at 1.0.4--the nightlies were deadlocking in this test on Friday, but that may be fixed now!

aphyr commented Apr 9, 2018

I'm looking at 1.0.4--the nightlies were deadlocking in this test on Friday, but that may be fixed now!

@janardhan1993

This comment has been minimized.

Show comment
Hide comment
@janardhan1993

janardhan1993 Apr 9, 2018

Contributor

Server side sequential mode is not there in 1.0.4. Can you please try with this build.
https://transfer.sh/Z5CTJ/dgraph-linux-amd64.tar.gz

It has the fix for deadlocking.

Contributor

janardhan1993 commented Apr 9, 2018

Server side sequential mode is not there in 1.0.4. Can you please try with this build.
https://transfer.sh/Z5CTJ/dgraph-linux-amd64.tar.gz

It has the fix for deadlocking.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 9, 2018

So, uh, Z5CTJ with server-side sequencing did this incredibly interesting thing:
20180409T115148.000-0500.zip

latency-raw 26

After a series of alpha and zero crashes and restarts, it took about 400 seconds to recover, then failed to return all but one of the 29662 acknowledged inserts. Instead, it gave a sequence of a whole boatload of value-less responses (that's the nils in the final reads), followed by the most recently inserted element.

190 :ok :read (nil nil nil nil ... nil nil nil 133143)

You can reproduce this with Jepsen 0ef6e711dfb07aad4afc84f7f9c3348961afa9d7:

lein run test --package-url https://transfer.sh/Z5CTJ/dgraph-linux-amd64.tar.gz --time-limit 300 --concurrency 100 --nemesis kill-alpha,fix-alpha,kill-zero --test-count 20 --workload set --final-recovery-time 200

Since we only insert records with a numeric :value, we should never be able to observe a record without them--yet here dgraph's doing that for tens of thousands of records. Something's wonky here...

aphyr commented Apr 9, 2018

So, uh, Z5CTJ with server-side sequencing did this incredibly interesting thing:
20180409T115148.000-0500.zip

latency-raw 26

After a series of alpha and zero crashes and restarts, it took about 400 seconds to recover, then failed to return all but one of the 29662 acknowledged inserts. Instead, it gave a sequence of a whole boatload of value-less responses (that's the nils in the final reads), followed by the most recently inserted element.

190 :ok :read (nil nil nil nil ... nil nil nil 133143)

You can reproduce this with Jepsen 0ef6e711dfb07aad4afc84f7f9c3348961afa9d7:

lein run test --package-url https://transfer.sh/Z5CTJ/dgraph-linux-amd64.tar.gz --time-limit 300 --concurrency 100 --nemesis kill-alpha,fix-alpha,kill-zero --test-count 20 --workload set --final-recovery-time 200

Since we only insert records with a numeric :value, we should never be able to observe a record without them--yet here dgraph's doing that for tens of thousands of records. Something's wonky here...

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 9, 2018

To make sure this wasn't something to do with the recovery process itself--like maybe servers could only expose these results immediately after coming back online--I ran another test with a 2000-second pause before we perform any read queries, and it exhibited the same losing-all-but-one-write behavior:

latency-raw 27

 {:valid? false,
  :lost
  "#{0..16982 16984..16987 16989..16999 17001 ... 63839..63840 63844}",
  :recovered "#{63845}",
  :ok "#{63845}",
  :recovered-frac 1/65624,
  :unexpected-frac 1/65624,
  :unexpected "#{nil}",
  :lost-frac 39115/65624,
  :ok-frac 1/65624},
 :valid? false}

aphyr commented Apr 9, 2018

To make sure this wasn't something to do with the recovery process itself--like maybe servers could only expose these results immediately after coming back online--I ran another test with a 2000-second pause before we perform any read queries, and it exhibited the same losing-all-but-one-write behavior:

latency-raw 27

 {:valid? false,
  :lost
  "#{0..16982 16984..16987 16989..16999 17001 ... 63839..63840 63844}",
  :recovered "#{63845}",
  :ok "#{63845}",
  :recovered-frac 1/65624,
  :unexpected-frac 1/65624,
  :unexpected "#{nil}",
  :lost-frac 39115/65624,
  :ok-frac 1/65624},
 :valid? false}
@janardhan1993

This comment has been minimized.

Show comment
Hide comment
@janardhan1993

janardhan1993 Apr 16, 2018

Contributor

Test no longer fails with
lein run test --package-url https://transfer.sh/mOBTj/dgraph-linux-amd64.tar.gz --time-limit 300 --concurrency 10 --nemesis kill-alpha,fix-alpha,kill-zero --test-count 20 --workload set

Contributor

janardhan1993 commented Apr 16, 2018

Test no longer fails with
lein run test --package-url https://transfer.sh/mOBTj/dgraph-linux-amd64.tar.gz --time-limit 300 --concurrency 10 --nemesis kill-alpha,fix-alpha,kill-zero --test-count 20 --workload set

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment